-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken filter for newly created delta table #2169
Comments
@Hanspagh can you please check the following things:
Also, it would really help if you can mimic the structure of the data with fake/sample data so we can try to reproduce, the only logical thing I can think of for now is the partition expression is incorrect |
Sure, let me try those suggestions first, then I can try to see if I can
reproduce the problem with a smaller subset
…On Tue, 6 Feb 2024, 15.03 Ion Koutsouris, ***@***.***> wrote:
@Hanspagh <https://github.com/Hanspagh> can you please check the
following things:
- try write_deltalake(engine='rust') since this eliminates pyarrow
from the equation (also please share the pyarrow version you use now)
- try deltalake v0.15.1 or v0.15.0
Also, it would really help if you can mimic the structure of the data with
fake/sample data so we can try to reproduce, the only logical thing I can
think off for now is the partition expression is incorrect
—
Reply to this email directly, view it on GitHub
<#2169 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAH2DIECN5SVWK4NPQWPFULYSIZ4VAVCNFSM6AAAAABC37HOTSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZG42DANJUGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Okay, so this seem to be related to pyarrow since engine="rust" fixes this. |
So I managed to reproduce this. It only happens with large dataset 10_485_761 seems to be the magic number, I tried both with pyarrow 15, 13, 12 10, 9. With pyarrow 8 the procces seems to hang when I try to save a frame this big. It looks as the filter overflows and only returns the rows that large than 10_485_760, since we get 1 for 10_485_761 and 2 for 10_485_762 10_485_760 is also oddly close to 1024**2 = 1_048_576 I hope this helps to figure out what is going on here. Let me know if you want me to provide more details df = pd.DataFrame({"data": ["B"] * 10_485_760 })
write_deltalake("sample.delta", df, mode="overwrite")
dt_broken = DeltaTable("sample.delta")
dt_broken.to_pyarrow_dataset().to_table(filter=(pc.field("data") == "B")).shape
# (10485760, 1)
df = pd.DataFrame({"data": ["B"] * 10_485_761 })
write_deltalake("sample.delta", df, mode="overwrite")
dt_broken = DeltaTable("sample.delta")
dt_broken.to_pyarrow_dataset().to_table(filter=(pc.field("data") == "B")).shape
# (1, 1)
df = pd.DataFrame({"data": ["B"] * 10_485_762 })
write_deltalake("sample.delta", df, mode="overwrite")
dt_broken = DeltaTable("sample.delta")
dt_broken.to_pyarrow_dataset().to_table(filter=(pc.field("data") == "B")).shape
# (2, 1) |
I found the magic number, it comes from the default of delta-rs/python/deltalake/writer.py Line 181 in 3ded236
The limit forces pyarrow to split the parquet in two and it seems like deltalake then ignores all but the last of those split files |
@Hanspagh there seems to be an issue with the creation of the pyarrow.dataset when data there are multiple parquets. I can write tables with v0.15.2 and then read them with v0.15.1 with the pc.field("data")=="B" expression. v0.15.2 gives this fragment expression:
While v0.15.1 gave |
You are right this is only a problem in 0.15.2 Also 0.15.2 seems to printing some debugging info
|
Hmm, but it does not seem to be strictly related to number of files this works fine df = pd.DataFrame({"data": ["B"] * 10 })
write_deltalake("broken.delta", df, max_rows_per_file=2, max_rows_per_group=2, min_rows_per_group=2)
# 5 output files
dt_broken = DeltaTable("/Users/hans.pagh/Downloads/broken.delta")
dt_broken.to_pyarrow_dataset().to_table(filter=(pc.field("data") == "B")).shape
# (10,1) |
@Hanspagh I see the issue, the stats are empty on the add action for one of the files, will have to check why they are empty now and not before : ) Edit: |
This seems to one of the smaller examples, where it is broken df = pd.DataFrame({"data": ["B"] * 1024 * 33 })
write_deltalake("broken.delta", df, max_rows_per_file=1024*32,max_rows_per_group=1024 * 16, min_rows_per_group=8*1024, mode="overwrite") This seems fine, so there is something with the max_rows_per_file df = pd.DataFrame({"data": ["B"] * 1024 * 33 })
write_deltalake("broken.delta", df, max_rows_per_file=1024*31,max_rows_per_group=1024 * 16, min_rows_per_group=8*1024, mode="overwrite") |
@Hanspagh found the culprit, there seems to be an empty row group in the parquet. Our function get_file_stats_from_metadata is checking whether the stats are set for each row group, but in this case the last row group is empty and has no stats set, so it's skipping to set stats |
Great find, really great to see this could be solved so fast. :) Unrelated to this, deltalake seems to create more row_groups than the pyarrow where the row limit per group is set to 1mill, is there a specific reason for this? |
@Hanspagh you mean the rust engine? |
No, your default settings for I also suspect that the pyarrow defaults |
@Hanspagh not sure, I think they originated from some defaults databricks does with spark-delta. Fix is incoming btw |
# Description For some odd reason the pyarrow parquet writer will leave empty row groups in the parquet file when it hits the max_row limit that's passed. While grabbing the stats we were checking if all row_groups were having stats added to them but these empty row groups had no stats so it causes the whole file add action to get no stats recorded. We now skip empty row groups while gathering the stats to prevent this. In v0.15.2 we now also evaluate files with no stats mentioned as null @roeap @rtyler not sure if this is entirely correct as well # Related Issue(s) - closes #2169 --------- Co-authored-by: Will Jones <[email protected]>
Environment
Delta-rs version:
'0.15.2'
Binding:
python
Environment:
Bug
When creating a new delta table from a pandas dataframe, it appears that the filter predicate is broken for some expression
What happened:
.to_pandas() and .to_pyarrow_dataset() return 0 data
What you expected to happen:
The above functions should return the data reflected in the filter predicate
How to reproduce it:
This is a large dataset that I cannot share, but please point me in any directions for how to debug this.
This is how I achieved my current results
Since this is returning partially correct results, I suspect maybe some row_group statistics being wrong, but then I would assume the calls from pyarrow would also return incorrect results
The text was updated successfully, but these errors were encountered: