-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust writer in operations makes a lot of data copies #1394
Comments
I do remember wondering the same thing (hence the TODO 😆). This just got carried over from previous implementations and I think originated somewhere in kafaka-delta ingest. I guess back then the write API exposed was much lower level - i.e. user would have to use the writer structs directly. In our codebase here I think we are not retrying anything. Also did a quick dive into what errors we may actually see there, and I think we either have an IO issue (where we do retry in object store) or somehow malformed data (did not go too deep into the parquet crate). So my vote would be for ripping it out. |
Okay. I'm going to refactor that module now. |
I just opened #1396, which may be relevant in that context, as it also affects the write path. |
# Description * Removed the data copies in a tight loop, which were extremely bad for performance when writing files > 100MB. * Rewrote statistics handling to collect null values from metadata, just like min and max. * Added support for more types in statistics. # Related Issue(s) - closes #1394 - closes #1209 - closes #1208 # Documentation <!--- Share links to useful documentation --->
# Description * Removed the data copies in a tight loop, which were extremely bad for performance when writing files > 100MB. * Rewrote statistics handling to collect null values from metadata, just like min and max. * Added support for more types in statistics. # Related Issue(s) - closes delta-io#1394 - closes delta-io#1209 - closes delta-io#1208 # Documentation <!--- Share links to useful documentation --->
Environment
We keep the Parquet file being written as a
Vec<u8>
. Thewrite_batch
method seems to clone the vec each time it is called, and the batch size for many operations is quite small, defaulting at 1024 rows.delta-rs/rust/src/operations/writer.rs
Lines 326 to 343 in 930d16e
@roeap Do you remember why we do this? I'm wondering whether we can just rip out this error handling, or if we actually do some sort of retry with this.
The text was updated successfully, but these errors were encountered: