We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Filtering for is_null() on a categorical column from a file read using scan_parquet() incorrectly returns zero rows.
is_null()
scan_parquet()
df.filter(pl.col("parent").is_null()).collect()
df.filter(~pl.col("parent").is_not_null()).collect()
df.with_columns(pl.col("parent").is_null().alias("null")).collect()
read_parquet()
import polars as pl pl.toggle_string_cache(True) df = pl.DataFrame([ pl.Series("node", ["1", "2"], dtype=pl.Categorical), pl.Series("parent", [None, "2"], dtype=pl.Categorical), ]).lazy() df.sink_parquet("test.parquet") print("read_parquet():") df = pl.read_parquet("test.parquet") print(df.filter(pl.col("parent").is_null())) # got 1 row, as expected print(df.filter(pl.col("parent").is_not_null())) print("scan_parquet():") df = pl.scan_parquet("test.parquet") print(df.filter(pl.col("parent").is_null()).collect()) # got 0 rows, expected 1 print(df.filter(pl.col("parent").is_not_null()).collect())
Output:
read_parquet(): shape: (1, 2) ┌──────┬────────┐ │ node ┆ parent │ │ --- ┆ --- │ │ cat ┆ cat │ ╞══════╪════════╡ │ 1 ┆ null │ └──────┴────────┘ shape: (1, 2) ┌──────┬────────┐ │ node ┆ parent │ │ --- ┆ --- │ │ cat ┆ cat │ ╞══════╪════════╡ │ 2 ┆ 2 │ └──────┴────────┘ scan_parquet(): shape: (0, 2) ┌──────┬────────┐ │ node ┆ parent │ │ --- ┆ --- │ │ cat ┆ cat │ ╞══════╪════════╡ └──────┴────────┘ shape: (1, 2) ┌──────┬────────┐ │ node ┆ parent │ │ --- ┆ --- │ │ cat ┆ cat │ ╞══════╪════════╡ │ 2 ┆ 2 │ └──────┴────────┘
The filtered result should be identical between read_parquet() and scan_parquet(). is_null() == ~is_not_null() should always be true.
is_null() == ~is_not_null()
---Version info--- Polars: 0.16.7 Index type: UInt32 Platform: Windows-10-10.0.22621-SP0 Python: 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] ---Optional dependencies--- pyarrow: <not installed> pandas: <not installed> numpy: 1.24.2 fsspec: <not installed> connectorx: <not installed> xlsx2csv: <not installed> deltalake: <not installed> matplotlib: <not installed>
The text was updated successfully, but these errors were encountered:
Got a fix upstream: jorgecarleitao/arrow2#1414
Sorry, something went wrong.
Successfully merging a pull request may close this issue.
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
Filtering for
is_null()
on a categorical column from a file read usingscan_parquet()
incorrectly returns zero rows.df.filter(pl.col("parent").is_null()).collect()
is incorrect.df.filter(~pl.col("parent").is_not_null()).collect()
works correctly.df.with_columns(pl.col("parent").is_null().alias("null")).collect()
is also correct.read_parquet()
.Reproducible example
Output:
Expected behavior
The filtered result should be identical between
read_parquet()
andscan_parquet()
.is_null() == ~is_not_null()
should always be true.Installed versions
The text was updated successfully, but these errors were encountered: