-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for nested list arrays from parquet to arrow arrays (#993) #1588
Conversation
let key_reader = { | ||
let mut key_context = new_context.clone(); | ||
key_context.def_level += 1; | ||
key_context.path.append(vec![map_key.name().to_string()]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a drive-by fix, the context is the context of the parent, not the value being dispatched. This would result in map_key appearing twice
|
||
// If the list can contain nulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend reading this with whitespace disabled https://github.com/apache/arrow-rs/pull/1588/files?w=1
) | ||
} | ||
|
||
fn test_nested_list<OffsetSize: OffsetSizeTrait>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this wins the prize of the most 🤯 test I've ever written
new_context.def_level += 1; | ||
new_context.rep_level += 1; | ||
false | ||
return Err(ArrowError(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just moves the error to earlier, there is no point continuing if this is not supported anyway 😁
ArrowType::List(_) | ||
| ArrowType::FixedSizeList(_, _) | ||
| ArrowType::Dictionary(_, _) => Err(ArrowError(format!( | ||
"reading List({:?}) into arrow not supported yet", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the error that would previously be returned on a nested list
Codecov Report
@@ Coverage Diff @@
## master #1588 +/- ##
==========================================
+ Coverage 83.02% 83.16% +0.13%
==========================================
Files 193 193
Lines 55612 56005 +393
==========================================
+ Hits 46174 46574 +400
+ Misses 9438 9431 -7
Continue to review full report at Codecov.
|
let mut skipped = 0; | ||
|
||
// Builder used to construct the filtered child data, skipping empty lists and nulls | ||
let mut child_data_builder = MutableArrayData::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to using this over the take kernel as I think it makes it easier to follow what is going on, it should also be significantly faster where nulls and empty lists are rare
I've confirmed this can read nested_schema.parquet from @andrei-ionescu on apache/datafusion#1383. I've also confirmed this can read https://github.com/Igosuki/arrow2/blob/main/part-00000-b4749aa1-94e4-4ddb-bab2-954c4d3a290f.c000.snappy.parquet and https://github.com/chauhanVritul/sampleparquet/blob/main/part-00000-8e5acb24-eb4e-491c-8c85-88799f25d1f0-c000.snappy.parquet provided by @Igosuki on #720 I've done a manual inspection that the output is the same as duckdb, and it looks good |
Thank you @tustvold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say I am a deep expert in this area or this part of the codebase, but I did give the PR a careful "software engineering" review, with careful study of the tests. LGTM 👍
FWIW whitespace blind diff https://github.com/apache/arrow-rs/pull/1588/files?w=1 helped in my review
/// item type. | ||
/// | ||
/// To fully understand this algorithm, please refer to | ||
/// [parquet doc](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). | ||
/// | ||
/// For example, a standard list type looks like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice document addition 👍
.build() | ||
.unwrap(); | ||
|
||
let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 4, 5]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be valuable to have a sublist that has more than 1 item?
It appears that this test only has null
, []
or a single element [x]
list. I may be misunderstanding the intent of the test or what is possible in nested lists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a [1, null]
, can probably add more though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 05c2311
true, | ||
); | ||
|
||
let actual = l1.next_batch(1024).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it also make sense to test with l1.next_batch(2)
(aka something that doesn't decode the entire array in one go?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 05c2311, required adding functionality to InMemoryArrayReader to actually respect it
Co-authored-by: Andrew Lamb <[email protected]>
@@ -1532,15 +1532,15 @@ mod tests { | |||
ArrowType::Int32, | |||
array_1.clone(), | |||
Some(vec![0, 1, 2, 3, 1]), | |||
Some(vec![1, 1, 1, 1, 1]), | |||
Some(vec![0, 1, 1, 1, 1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by fix, this set of levels is impossible, as the first rep level must be 0. Nothing cares as this test doesn't actually decode the implied list, but 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at the changes, especially on ListArrayReader
. I'm not Parquet expert too, but the change looks reasonable when looking at with the added comments. 👍
/// The definition level at which this list is not null | ||
def_level: i16, | ||
/// The repetition level that corresponds to a new value in this array | ||
rep_level: i16, | ||
/// If this list is nullable | ||
nullable: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment looks clear. 👍
/// item type. | ||
/// | ||
/// To fully understand this algorithm, please refer to | ||
/// [parquet doc](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md). | ||
/// | ||
/// For example, a standard list type looks like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice document addition 👍
Which issue does this PR close?
Closes #993.
Closes #720
Rationale for this change
Adds support for reading nested lists from parquet.
What changes are included in this PR?
Reworks the ListArrayReader to handle nested repetition levels.
Some drive by cleanup of ArrayReaderBuilder to error on encountering non-spec-compliant data
Are there any user-facing changes?
No