Skip to content

Commit

Permalink
More docs
Browse files Browse the repository at this point in the history
  • Loading branch information
tustvold committed Jan 12, 2022
1 parent b51fdd6 commit 55c2f6f
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 0 deletions.
16 changes: 16 additions & 0 deletions parquet/src/arrow/record_reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,22 @@ where
/// If `null_mask_only` is true only the null bitmask will be generated and
/// [`Self::consume_def_levels`] and [`Self::consume_rep_levels`] will always return `None`
///
/// It is insufficient to solely check that that the max definition level is 1 as we
/// need there to be no nullable parent array that will required decoded definition levels
///
/// In particular consider the case of:
///
/// ```ignore
/// message nested {
/// OPTIONAL Group group {
/// REQUIRED INT32 leaf;
/// }
/// }
/// ```
///
/// The maximum definition level of leaf is 1, however, we still need to decode the
/// definition levels so that the parent group can be constructed correctly
///
pub(crate) fn new_with_options(desc: ColumnDescPtr, null_mask_only: bool) -> Self {
let def_levels = (desc.max_def_level() > 0)
.then(|| DefinitionLevelBuffer::new(&desc, null_mask_only));
Expand Down
18 changes: 18 additions & 0 deletions parquet/src/arrow/record_reader/definition_levels.rs
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ enum BufferInner {
max_level: i16,
},
/// Only compute null bitmask - requires max level to be 1
///
/// This is an optimisation for the common case of a nullable scalar column, as decoding
/// the definition level data is only required when decoding nested structures
///
Mask { nulls: BooleanBufferBuilder },
}

Expand Down Expand Up @@ -228,6 +232,20 @@ impl ColumnLevelDecoder for DefinitionLevelDecoder {
}
}

/// An optimized decoder for decoding [RLE] and [BIT_PACKED] data with a bit width of 1
/// directly into a bitmask
///
/// This is significantly faster than decoding the data into `[i16]` and then computing
/// a bitmask from this, as not only can it skip this buffer allocation and construction,
/// but it can exploit properties of the encoded data to reduce work further
///
/// In particular:
///
/// * Packed runs are already bitmask encoded and can simply be appended
/// * Runs of 1 or 0 bits can be efficiently appended with byte (or larger) operations
///
/// [RLE]: https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3
/// [BIT_PACKED]: https://github.com/apache/parquet-format/blob/master/Encodings.md#bit-packed-deprecated-bit_packed--4
struct PackedDecoder {
data: ByteBufferPtr,
data_offset: usize,
Expand Down

0 comments on commit 55c2f6f

Please sign in to comment.