This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 224
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added support to read structArray from parquet (#547)
* Added parquet StructArray * Added support for nested struct. * Updated examples.
- Loading branch information
1 parent
e5981ea
commit 9d4107c
Showing
18 changed files
with
694 additions
and
465 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
## Observations | ||
|
||
### LSB equivalence between definition levels and bitmaps | ||
|
||
When the maximum repetition level is 0 and the maximum definition level is 1, | ||
the RLE-encoded definition levels correspond exactly to Arrow's bitmap and can be | ||
memcopied without further transformations. | ||
|
||
## Nested parquet groups are deserialized recursively | ||
|
||
Reading a parquet nested field is done by reading each primitive | ||
column sequentially, and build the nested struct recursively. | ||
|
||
Rows of nested parquet groups are encoded in the repetition and definition levels. | ||
In arrow, they correspond to: | ||
* list's offsets and validity | ||
* struct's validity | ||
|
||
The implementation in this module leverages this observation: | ||
|
||
Nested parquet fields are initially recursed over to gather | ||
whether the type is a Struct or List, and whether it is required or optional, which we store | ||
in `nested_info: Vec<Box<dyn Nested>>`. `Nested` is a trait object that receives definition | ||
and repetition levels depending on the type and nullability of the nested item. | ||
We process the definition and repetition levels into `nested_info`. | ||
|
||
When we finish a field, we recursively pop from `nested_info` as we build | ||
the `StructArray` or `ListArray`. | ||
|
||
With this approach, the only difference vs flat is: | ||
1. we do not leverage the bitmap optimization, and instead need to deserialize the repetition | ||
and definition levels to `i32`. | ||
2. we deserialize definition levels twice, once to extend the values/nullability and | ||
one to extend `nested_info`. |
Oops, something went wrong.