-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: more efficient parquet writer and more statistics #1397
Conversation
@@ -324,40 +318,27 @@ impl PartitionWriter { | |||
} | |||
|
|||
fn write_batch(&mut self, batch: &RecordBatch) -> DeltaResult<()> { | |||
// copy current cursor bytes so we can recover from failures | |||
// TODO is copying this something we should be doing? | |||
let buffer_bytes = self.buffer.to_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the offending line causing a lot of data copying.
@@ -528,7 +698,7 @@ mod tests { | |||
("some_bool", ColumnCountStat::Value(v)) => assert_eq!(100, *v), | |||
("some_string", ColumnCountStat::Value(v)) => assert_eq!(100, *v), | |||
("some_list", ColumnCountStat::Value(v)) => assert_eq!(100, *v), | |||
("some_nested_list", ColumnCountStat::Value(v)) => assert_eq!(0, *v), | |||
("some_nested_list", ColumnCountStat::Value(v)) => assert_eq!(100, *v), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this test was incorrect, given the null values here:
delta-rs/rust/src/writer/stats.rs
Line 885 in 0ee692c
"some_nested_list": [[42], null], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really looking forward to see this in action! Feels like this a great step forward for our write experience!
rust/src/writer/stats.rs
Outdated
@@ -448,6 +615,8 @@ mod tests { | |||
let mut null_count_keys = vec!["some_list", "some_nested_list"]; | |||
null_count_keys.extend_from_slice(min_max_keys.as_slice()); | |||
|
|||
dbg!(&stats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug artifact?
} | ||
|
||
impl AddAssign for AggregatedStats { | ||
fn add_assign(&mut self, rhs: Self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
# Description * Removed the data copies in a tight loop, which were extremely bad for performance when writing files > 100MB. * Rewrote statistics handling to collect null values from metadata, just like min and max. * Added support for more types in statistics. # Related Issue(s) - closes delta-io#1394 - closes delta-io#1209 - closes delta-io#1208 # Documentation <!--- Share links to useful documentation --->
Description
Related Issue(s)
Documentation