-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support to write sidecar #147
Conversation
Codecov Report
@@ Coverage Diff @@
## main #147 +/- ##
==========================================
+ Coverage 74.62% 74.69% +0.06%
==========================================
Files 78 78
Lines 3630 3639 +9
==========================================
+ Hits 2709 2718 +9
Misses 921 921
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I looked through the PR and everything makes sense.
I might pull this branch to test on it, but probably not necessary before merge.
/// Note: Recall that when combining row groups from [`FileMetaData`], the `file_path` on each | ||
/// of their column chunks must be updated with their path relative to where they are written to. | ||
pub fn write_metadata_sidecar<W: Write>(writer: &mut W, metadata: &FileMetaData) -> Result<u64> { | ||
let mut len = start_file(writer)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch! I was able to verify that a _metadata
file written with pyarrow had PAR1
as both the first four and the last four bytes.
/// Returns the underlying writer and [`FileMetaData`] | ||
/// # Panics | ||
/// This function panics if [`Self::end`] has not yet been called | ||
pub fn into_inner_and_metadata(self) -> (W, FileMetaData) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I didn't realize originally (because they're both named FileMetaData
) is that this returns the thrift FileMetaData
and not the parquet2
FileMetaData
.
I have a slight feeling that returning the parquet2
FileMetaData
would be cleaner here, so that applications don't have to mix and match the two structs. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I tried that, but it is not easy - FileMetadata
is a "read-specialized" struct - it contains column descriptors derived (transversed from the parquet schema tree) from the parquet schema that are used across all row groups.
To expose FileMetadata
when writing requires doing the same for the sole purpose of this feature, which I considered wasteful.
In other words, I see the need for 3 structs:
- "ThriftFileMetadata" - what we convert from and to bytes
- "ReadFileMetadata" - "annotated row groups from transversing the schema in ThriftFileMetadata" + "validated ThriftFileMetadata (e.g. num_rows is
usize
)" - "WrittenFileMetadata" - "validated ThriftFileMetadata (e.g. num_rows is
usize
)"
from which "ThriftFileMetadata" should be an implementation detail of this crate. But I do not have good idea how to do this. Ideas welcome :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you can perform the conversion using FileMetadata::try_from_thrift
, which performs the conversion and a some validation
Closes #146