-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyarrow is_null filter not working as expected after loading using deltalake #1496
Comments
Hi, I'd appreciate any insights on whether this is something i can fix on my end. I'm adding an example of how to reproduce the error as well in the issue. |
Hi @paul-rohith. It sounds like we need to update the statistics conversion to handle nulls better. We handle the case where there is all null or all valid, but don't specify anything when there is mixed values. Lines 645 to 657 in 56dfd25
I suspect we instead need write guarantees like
or perhaps we need to be using |
Hi, while I'd love to contribute to fixing the bug, I have no experience with Rust - I'm currently working in Python using the deltalake package. |
Im on vacation this week so I’ll take a look next week. I don’t think there’s an obvious Python only patch. But the Rust code itself isn’t too hard. It’s just calling into some Python code. |
Hi, thanks for your time! |
After some further testing, I think the bug is actually a little more general than what I initially reported: |
Instead of loading the whole table, you can convert to a dataset, and then get a record batch reader off of that. That will let you read data in batches and filter from there |
Won't this still involve loading the entire data, just not in one go? |
Yes, you can just read in batches. I suppose you could also just pass predicates that don’t involve nullable columns for the time being. |
# Description Fixes issue where predicate pushdown isn't working for null values. This adds tests for both columns and partition columns. # Related Issue(s) - closes #1496 # Documentation <!--- Share links to useful documentation --->
Environment
Delta-rs version: 0.10.0
Binding: Python
Environment: Seemed to have the same issue across Windows 10 and Amazon Workspaces
Bug
Explained in the following stackoverflow question: https://stackoverflow.com/questions/76557635/possible-bug-in-using-pyarrow-is-null-function-with-delta-tables
What happened: The is_null filter only returned rows from partitions where all rows have null values for the column in question i.e. any rows with null values that belonged to a partition that had other non-null values was not returned. Using pa.dataset.dataset("/path/to/table") works as expected.
What you expected to happen: I expected all rows with null values to be filtered.
How to reproduce it:
I'd expect this code to retrieve the 3 rows with None values but it only retrieves 2 of them - the ones which belong to the same partition ('b').
More details:
The text was updated successfully, but these errors were encountered: