-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datafusion: unreachable code reached when parsing statistics with missing columns #1374
Comments
Apologies for the issue title - not really sure what to call it here |
Would it be possible for you to provide a repro example, or print the values we are passing into |
@roeap I'll try and get a reproducible delta table created today and add it as a test case. |
I was able to recreate this in a test and I believe it is related / duplicate of #1372. When running test on 2077e6a it errors out while on 1b01f9f it succeeds. Here is the test case for ref: main...cmackenzie1:delta-rs:cole/issue-1374 Edit: err, hold that thought. I am still seeing this error on some tables of mine. Will report back |
Ok, I think I've narrowed it down and there is two bugs at play- with the first one being solved in #1372. This one happens in delta-rs/rust/src/delta_datafusion.rs Lines 273 to 276 in dde2f7e
What I have been observing is that sometimes the In this scenario, the table is loaded by a combination of parquet checkpoint + JSON and the field missing in the stats is SchemaField::new(
"EdgeStartTimestamp".to_string(),
SchemaDataType::primitive("timestamp".to_string()),
true,
HashMap::new(),
), I've injected some
I am not sure what the behavior should be in this scenario, but I imagine that being unable to prune the file via statistics would just mean the file is kept for querying. |
Looking into the // [true, Null, false]
let scalars = vec![
ScalarValue::Boolean(Some(true)),
ScalarValue::Boolean(None),
ScalarValue::Boolean(Some(false)),
]; |
Arrived at the same conclusion. I guess the |
Wouldn't those be covered by the Also, I think this raises another issue: why are the SELECT
add.path,
add.stats_parsed.minvalues.edgestarttimestamp,
add.stats_parsed.maxvalues.edgestarttimestamp
FROM read_parquet('./testdata/http_requests/_delta_log/00000000000000000400.checkpoint.parquet')
WHERE add.path = 'date=2023-05-16/part-00000-de989f29-71ff-4906-be26-3c36fb604c6a-c000.snappy.parquet'
| path | edgestarttimestamp | edgestarttimestamp |
|-------------------------------------------------------------------------------------|--------------------|--------------------|
| date=2023-05-16/part-00000-de989f29-71ff-4906-be26-3c36fb604c6a-c000.snappy.parquet | | | SELECT
min(EdgeStartTimestamp),
max(EdgeStartTimestamp)
FROM read_parquet('./testdata/http_requests/date=2023-05-16/part-00000-de989f29-71ff-4906-be26-3c36fb604c6a-c000.snappy.parquet')
| min(EdgeStartTimestamp) | max(EdgeStartTimestamp) |
|-------------------------|-------------------------|
| 2023-05-16 22:09:26 | 2023-05-16 22:11:04 | SELECT count(*)
FROM read_parquet('./testdata/http_requests/date=2023-05-16/part-00000-de989f29-71ff-4906-be26-3c36fb604c6a-c000.snappy.parquet')
WHERE EdgeStartTimestamp IS NULL
| count_star() |
|--------------|
| 0 | |
I believe the stats_parsed field is always just optional, an delta does not necessarily collect stats for all columns, in which case it would always be be null for columns where there are no stats collected. |
Well, not entirely sure how that would exactly work out. But if there are conditions defined these would tell datafusion to skip the file (i.e. min and max are NULL and thus none would match IS NOT NULL), I dont think df re-validates this by comparing null count and num rows to realize that there are in fact non null rows in the file. |
Test table `issue_1374` was created by hand to have 2 data files where only one file has the `min_values` for the statistics in the `checkpoint.parquet` file set to null in order to trigger the bug. There is no other significance to the table other than to demonstrate issue delta-io#1374. ``` internal error: entered unreachable code thread 'test_issue_1374' panicked at 'internal error: entered unreachable code', /Users/cole/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-common-24.0.0/src/scalar.rs:2472:26 ```
Test table `issue_1374` was created by hand to have 2 data files where only one file has the `min_values` for the statistics in the `checkpoint.parquet` file set to null in order to trigger the bug. There is no other significance to the table other than to demonstrate issue delta-io#1374. ``` internal error: entered unreachable code thread 'test_issue_1374' panicked at 'internal error: entered unreachable code', /Users/cole/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-common-24.0.0/src/scalar.rs:2472:26 ```
Test table `issue_1374` was created by hand to have 2 data files where only one file has the `min_values` for the statistics in the `checkpoint.parquet` file set to null in order to trigger the bug. There is no other significance to the table other than to demonstrate issue delta-io#1374. ``` internal error: entered unreachable code thread 'test_issue_1374' panicked at 'internal error: entered unreachable code', /Users/cole/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-common-24.0.0/src/scalar.rs:2472:26 ```
Test table `issue_1374` was created by hand to have 2 data files where only one file has the `min_values` for the statistics in the `checkpoint.parquet` file set to null in order to trigger the bug. There is no other significance to the table other than to demonstrate issue delta-io#1374. ``` internal error: entered unreachable code thread 'test_issue_1374' panicked at 'internal error: entered unreachable code', /Users/cole/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-common-24.0.0/src/scalar.rs:2472:26 ```
# Description Switch the `get_prune_stats` functions to use `None` to represent `null` instead of `ScalarValue::Null` as `ArrayRef` must be of all the same type. # Related Issue(s) - closes #1374 # Documentation https://github.com/apache/arrow-datafusion/blob/dd5e1dbbfd20539b40ae65acb8883f7e392cba92/datafusion/core/src/physical_optimizer/pruning.rs#L54-L72 --------- Co-authored-by: R. Tyler Croy <[email protected]>
# Description Switch the `get_prune_stats` functions to use `None` to represent `null` instead of `ScalarValue::Null` as `ArrayRef` must be of all the same type. # Related Issue(s) - closes delta-io#1374 # Documentation https://github.com/apache/arrow-datafusion/blob/dd5e1dbbfd20539b40ae65acb8883f7e392cba92/datafusion/core/src/physical_optimizer/pruning.rs#L54-L72 --------- Co-authored-by: R. Tyler Croy <[email protected]>
Environment
Delta-rs version:
0.11.0
,0.10.0
Binding: rust
Environment:
Bug
What happened:
Reached https://github.com/apache/arrow-datafusion/blob/37b2c53f281b9550034e7e69f5acf1ae666a0da7/datafusion/common/src/scalar.rs#L2472 when querying table with datafusion. It looks like the issue may have been reached from
delta-rs/rust/src/operations/transaction/state.rs
Line 204 in 8a4b2b8
What you expected to happen:
Query executes successfully and returns the matching results.
How to reproduce it:
More details:
Stack trace
The text was updated successfully, but these errors were encountered: