-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parquet reading performance for columns with nulls by preserving bitmask when possible (#1037) #1054
Changes from 1 commit
94d66ad
3ab4125
b51fdd6
55c2f6f
59846eb
b001f11
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -41,6 +41,10 @@ enum BufferInner { | |||||||||||
max_level: i16, | ||||||||||||
}, | ||||||||||||
/// Only compute null bitmask - requires max level to be 1 | ||||||||||||
tustvold marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||
/// | ||||||||||||
/// This is an optimisation for the common case of a nullable scalar column, as decoding | ||||||||||||
/// the definition level data is only required when decoding nested structures | ||||||||||||
/// | ||||||||||||
Mask { nulls: BooleanBufferBuilder }, | ||||||||||||
} | ||||||||||||
|
||||||||||||
|
@@ -228,6 +232,20 @@ impl ColumnLevelDecoder for DefinitionLevelDecoder { | |||||||||||
} | ||||||||||||
} | ||||||||||||
|
||||||||||||
/// An optimized decoder for decoding [RLE] and [BIT_PACKED] data with a bit width of 1 | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||||||||||||
/// directly into a bitmask | ||||||||||||
/// | ||||||||||||
/// This is significantly faster than decoding the data into `[i16]` and then computing | ||||||||||||
/// a bitmask from this, as not only can it skip this buffer allocation and construction, | ||||||||||||
/// but it can exploit properties of the encoded data to reduce work further | ||||||||||||
/// | ||||||||||||
/// In particular: | ||||||||||||
/// | ||||||||||||
/// * Packed runs are already bitmask encoded and can simply be appended | ||||||||||||
/// * Runs of 1 or 0 bits can be efficiently appended with byte (or larger) operations | ||||||||||||
/// | ||||||||||||
/// [RLE]: https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3 | ||||||||||||
/// [BIT_PACKED]: https://github.com/apache/parquet-format/blob/master/Encodings.md#bit-packed-deprecated-bit_packed--4 | ||||||||||||
struct PackedDecoder { | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This code looks quite similar to I wonder if you looked at possibly reusing that implmentation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The short answer is not using that implementation is the major reason this PR represents a non-trivial speed bump, it can decode more optimally as it can decode directly using append_packed_range / append_n. Will add some comments clarifying There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am trying to leave breadcrumbs for the next person to look at this code. Is this a correct description of what this structure implements? |
||||||||||||
data: ByteBufferPtr, | ||||||||||||
data_offset: usize, | ||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the use of
null_mask_only
here -- I thoughtnull_mask_only
would be set only ifmax_def_level() ==
)AKA https://github.com/apache/arrow-rs/pull/1054/files#diff-0d6bed48d78c5a2472b7680a8185cabdc0bd259d6484e184439ed7830060661fR1374
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment clarifying, its an edge case of nested nullability. Perhaps I should add an explicit test 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test added in 59846eb