Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support to write sidecar #147

Merged
merged 1 commit into from
Jun 8, 2022
Merged

Added support to write sidecar #147

merged 1 commit into from
Jun 8, 2022

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Jun 7, 2022

Closes #146

@codecov-commenter
Copy link

codecov-commenter commented Jun 7, 2022

Codecov Report

Merging #147 (fc69c9a) into main (de6039b) will increase coverage by 0.06%.
The diff coverage is 81.81%.

@@            Coverage Diff             @@
##             main     #147      +/-   ##
==========================================
+ Coverage   74.62%   74.69%   +0.06%     
==========================================
  Files          78       78              
  Lines        3630     3639       +9     
==========================================
+ Hits         2709     2718       +9     
  Misses        921      921              
Impacted Files Coverage Δ
src/write/mod.rs 100.00% <ø> (+25.00%) ⬆️
src/write/file.rs 87.50% <81.81%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de6039b...fc69c9a. Read the comment docs.

Copy link
Contributor

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I looked through the PR and everything makes sense.
I might pull this branch to test on it, but probably not necessary before merge.

/// Note: Recall that when combining row groups from [`FileMetaData`], the `file_path` on each
/// of their column chunks must be updated with their path relative to where they are written to.
pub fn write_metadata_sidecar<W: Write>(writer: &mut W, metadata: &FileMetaData) -> Result<u64> {
let mut len = start_file(writer)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch! I was able to verify that a _metadata file written with pyarrow had PAR1 as both the first four and the last four bytes.

@jorgecarleitao jorgecarleitao merged commit 8027600 into main Jun 8, 2022
@jorgecarleitao jorgecarleitao changed the title Added API to write sidecar Added support to write sidecar Jun 8, 2022
@jorgecarleitao jorgecarleitao deleted the sidecar branch June 10, 2022 04:59
/// Returns the underlying writer and [`FileMetaData`]
/// # Panics
/// This function panics if [`Self::end`] has not yet been called
pub fn into_inner_and_metadata(self) -> (W, FileMetaData) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I didn't realize originally (because they're both named FileMetaData) is that this returns the thrift FileMetaData and not the parquet2 FileMetaData.

I have a slight feeling that returning the parquet2 FileMetaData would be cleaner here, so that applications don't have to mix and match the two structs. Thoughts?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I tried that, but it is not easy - FileMetadata is a "read-specialized" struct - it contains column descriptors derived (transversed from the parquet schema tree) from the parquet schema that are used across all row groups.

To expose FileMetadata when writing requires doing the same for the sole purpose of this feature, which I considered wasteful.

In other words, I see the need for 3 structs:

  • "ThriftFileMetadata" - what we convert from and to bytes
  • "ReadFileMetadata" - "annotated row groups from transversing the schema in ThriftFileMetadata" + "validated ThriftFileMetadata (e.g. num_rows is usize)"
  • "WrittenFileMetadata" - "validated ThriftFileMetadata (e.g. num_rows is usize)"

from which "ThriftFileMetadata" should be an implementation detail of this crate. But I do not have good idea how to do this. Ideas welcome :)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you can perform the conversion using FileMetadata::try_from_thrift, which performs the conversion and a some validation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collect RowGroupMetaData when writing Parquet dataset for writing _metadata sidecar
3 participants