-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: omit unmodified files during merge write #1969
Conversation
Benchmarks
Logs from benchmarks
|
@thomasfrederikhoeck Predicates can be pushed down to the delta scan now if a full scan is not required. The optimizer will attempt to prune files based on the predicate. E.G If you have a predicate like #1958 takes it to the next level by determining distinct partition values that occur in the source and then prunes from the scan. |
@Blajda ah, ok. Thank you for explaing! I'm looking forward to trying them out! |
a9ddc9e
to
9884480
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!! Left two small comments
@@ -1,5 +1,7 @@ | |||
//! Logical Operations for DataFusion | |||
|
|||
use std::collections::HashSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use hashbrown here instead?
I read somewhere it's the default in a newer version in the std library but not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah as you called out they are changing the underlying implementation to be hashbrown
Description
Implements a new Datafusion node called
MergeBarrier
that determines which files have modifications. For files that do not have modifications a remove action is no longer created.Related Issue(s)