Added support to write sidecar #147

jorgecarleitao · 2022-06-07T21:42:22Z

Closes #146

codecov-commenter · 2022-06-07T21:46:17Z

Codecov Report

Merging #147 (fc69c9a) into main (de6039b) will increase coverage by 0.06%.
The diff coverage is 81.81%.

@@            Coverage Diff             @@
##             main     #147      +/-   ##
==========================================
+ Coverage   74.62%   74.69%   +0.06%     
==========================================
  Files          78       78              
  Lines        3630     3639       +9     
==========================================
+ Hits         2709     2718       +9     
  Misses        921      921

Impacted Files	Coverage Δ
src/write/mod.rs	`100.00% <ø> (+25.00%)`	⬆️
src/write/file.rs	`87.50% <81.81%> (+0.14%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de6039b...fc69c9a. Read the comment docs.

kylebarron

This looks great! I looked through the PR and everything makes sense.
I might pull this branch to test on it, but probably not necessary before merge.

kylebarron · 2022-06-07T22:06:12Z

src/write/file.rs

+/// Note: Recall that when combining row groups from [`FileMetaData`], the `file_path` on each
+/// of their column chunks must be updated with their path relative to where they are written to.
+pub fn write_metadata_sidecar<W: Write>(writer: &mut W, metadata: &FileMetaData) -> Result<u64> {
+    let mut len = start_file(writer)?;


Ah good catch! I was able to verify that a _metadata file written with pyarrow had PAR1 as both the first four and the last four bytes.

kylebarron · 2022-06-13T22:01:17Z

src/write/file.rs

+    /// Returns the underlying writer and [`FileMetaData`]
+    /// # Panics
+    /// This function panics if [`Self::end`] has not yet been called
+    pub fn into_inner_and_metadata(self) -> (W, FileMetaData) {


One thing I didn't realize originally (because they're both named FileMetaData) is that this returns the thrift FileMetaData and not the parquet2 FileMetaData.

I have a slight feeling that returning the parquet2 FileMetaData would be cleaner here, so that applications don't have to mix and match the two structs. Thoughts?

I agree. I tried that, but it is not easy - FileMetadata is a "read-specialized" struct - it contains column descriptors derived (transversed from the parquet schema tree) from the parquet schema that are used across all row groups.

To expose FileMetadata when writing requires doing the same for the sole purpose of this feature, which I considered wasteful.

In other words, I see the need for 3 structs:

"ThriftFileMetadata" - what we convert from and to bytes

"ReadFileMetadata" - "annotated row groups from transversing the schema in ThriftFileMetadata" + "validated ThriftFileMetadata (e.g. num_rows is usize)"

"WrittenFileMetadata" - "validated ThriftFileMetadata (e.g. num_rows is usize)"

from which "ThriftFileMetadata" should be an implementation detail of this crate. But I do not have good idea how to do this. Ideas welcome :)

Note that you can perform the conversion using FileMetadata::try_from_thrift, which performs the conversion and a some validation

jorgecarleitao added the feature A new feature label Jun 7, 2022

jorgecarleitao mentioned this pull request Jun 7, 2022

Collect RowGroupMetaData when writing Parquet dataset for writing _metadata sidecar #146

Closed

jorgecarleitao force-pushed the sidecar branch from 6162bb9 to f7fe0bd Compare June 7, 2022 21:47

Added API to write sidecar

fc69c9a

jorgecarleitao force-pushed the sidecar branch from f7fe0bd to fc69c9a Compare June 7, 2022 21:51

kylebarron approved these changes Jun 7, 2022

View reviewed changes

jorgecarleitao merged commit 8027600 into main Jun 8, 2022

jorgecarleitao changed the title ~~Added API to write sidecar~~ Added support to write sidecar Jun 8, 2022

jorgecarleitao deleted the sidecar branch June 10, 2022 04:59

kylebarron reviewed Jun 13, 2022

View reviewed changes

dantengsky mentioned this pull request Jun 20, 2022

Improvement: abandon internal patches of parquet2 databendlabs/databend#6064

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support to write sidecar #147

Added support to write sidecar #147

jorgecarleitao commented Jun 7, 2022 •

edited

Loading

codecov-commenter commented Jun 7, 2022 •

edited

Loading

kylebarron left a comment

kylebarron Jun 7, 2022

kylebarron Jun 13, 2022

jorgecarleitao Jun 14, 2022

jorgecarleitao Jun 15, 2022

Added support to write sidecar #147

Added support to write sidecar #147

Conversation

jorgecarleitao commented Jun 7, 2022 • edited Loading

codecov-commenter commented Jun 7, 2022 • edited Loading

Codecov Report

kylebarron left a comment

Choose a reason for hiding this comment

kylebarron Jun 7, 2022

Choose a reason for hiding this comment

kylebarron Jun 13, 2022

Choose a reason for hiding this comment

jorgecarleitao Jun 14, 2022

Choose a reason for hiding this comment

jorgecarleitao Jun 15, 2022

Choose a reason for hiding this comment

jorgecarleitao commented Jun 7, 2022 •

edited

Loading

codecov-commenter commented Jun 7, 2022 •

edited

Loading