-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: scan partitioned tables with datafusion #1303
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All things considered I think this is fine to land. Are there some issues filed with Data Fusion that can be dropped in these comments for the missing kernels?
Eventually, I think our desired behavior is going to be put the partition columns where they are in the Delta table schema.
Looking at the PR for that apache/datafusion#5545 (comment), Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor suggestions for comments but otherwise seems good.
ACTION NEEDED delta-rs follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
This commit adds a unit test demonstrating the issue described in delta-io#1292.
This commit adds a test case to demonstate being unable to query a partitioned table using `>=` as type coercion fails.
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
ef7649d
to
b905626
Compare
Description
This PR builds on #1293 and tries to address some of the issues we have seen with scanning
partitioned tables with datafusion. And while all our tests (mostly - more on that later) pass,
the fix involves some behaviours that we may or may not want to adopt. Specifically, Datafusion
appends partition columns at the end of the schema fields, while we have been reporting them
as leading columns.
In recent datafusion versions also changed the default for dictionary encoding partition columns
to be opt in. My thinking was that for the vast majority of tables keeping dictionary encoding for
partition columns would be the desired behaviour. (@wjones127, do you have an opinion on that?).
This was also a root cause or at least related to the second failing test.
I did have to comment out some caeses within out file pruning tests where we create expression with
nulls, as I have thus far not been able to create an expression that datafusion is happy with. I'll
keep trying, but have some work on expression parsing for handling user inputs planned as well.
There already is a draft PR open (#1267), which does not contain that yet, but where I plan to
address this.
cc @cmackenzie1
Related Issue(s)