You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is a random idea, but it seems like it would be valuable to be able to concatenate parquet files without deserializing to Arrow and re-serializing back to Parquet. I'm not 100% sure that it would be possible, but does seem like you should in theory be able to just copy the row group buffers and then update the offsets within the row group metadata in the footer.
You can only do this if the schemas match, of course.
Describe the solution you'd like
If this is indeed possible, then some function like (apologies, my Rust interface design isn't great yet):
The obvious alternative is to simple read as Arrow, concatenate, and then serialize back, but reading and writing parquet is famously compute intensive, so would be nice if we could avoid that.
Additional context
Concatenating parquet files is a common operation in Delta Lake tables, which may initially write out many small files that later need to be merged for better read performance. See delta-io/delta-rs#98.
The text was updated successfully, but these errors were encountered:
This sounds like a good idea to me, and could possibly feed into some sort of story for parallel writing 👍
It is probably worth highlighting though that whilst merging parquet files without rewriting the row groups will theoretically reduce the IO required to fetch them from object storage, along with any catalog overheads, it likely won't help with the CPU-bound portion of actually decoding the bytes, nor with compression.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is a random idea, but it seems like it would be valuable to be able to concatenate parquet files without deserializing to Arrow and re-serializing back to Parquet. I'm not 100% sure that it would be possible, but does seem like you should in theory be able to just copy the row group buffers and then update the offsets within the row group metadata in the footer.
You can only do this if the schemas match, of course.
Describe the solution you'd like
If this is indeed possible, then some function like (apologies, my Rust interface design isn't great yet):
Describe alternatives you've considered
The obvious alternative is to simple read as Arrow, concatenate, and then serialize back, but reading and writing parquet is famously compute intensive, so would be nice if we could avoid that.
Additional context
Concatenating parquet files is a common operation in Delta Lake tables, which may initially write out many small files that later need to be merged for better read performance. See delta-io/delta-rs#98.
The text was updated successfully, but these errors were encountered: