Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Parquet writes incorrect List<u32> #1368

Closed
ritchie46 opened this issue Jan 18, 2023 · 0 comments · Fixed by #1390
Closed

Parquet writes incorrect List<u32> #1368

ritchie46 opened this issue Jan 18, 2023 · 0 comments · Fixed by #1390
Labels
bug Something isn't working

Comments

@ritchie46
Copy link
Collaborator

ritchie46 commented Jan 18, 2023

Add the boundary of 349_526 rows with 349_525 nulls and the last value specified the parquet file that is written is incorrect.

This seems to also be related to the row groups size: see original issue report: pola-rs/polars#6289

The most minimal example I could make is:

f = io.BytesIO()
df = pl.Series('a', [*[None]*349_525, [1, 2]], dtype=pl.List(pl.UInt32)).to_frame()
print(df.tail(1))

f.seek(0)
df.write_parquet(f)
f.seek(0)
print(pl.read_parquet(f).tail(1))  # we expect the same `[1, 2]` here, but we get `[null, null]`
shape: (1, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[u32] │
╞═══════════╡
│ [1, 2]    │
└───────────┘
shape: (1, 1)
┌──────────────┐
│ a            │
│ ---          │
│ list[u32]    │
╞══════════════╡
│ [null, null] │
└──────────────┘

The state of the df is:

print(df)
shape: (349526, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[u32] │
╞═══════════╡
│ null      │
│ null      │
│ null      │
│ null      │
│ ...       │
│ null      │
│ null      │
│ null      │
│ [1, 2]    │
└───────────┘

When we use the pyarrow backend for writing the output is as expected.

@ritchie46 ritchie46 added the bug Something isn't working label Jan 18, 2023
@ritchie46 ritchie46 changed the title Parquet writes incorrect List<u32> Parquet writes incorrect List<u32> Jan 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant