Fix calculation of nested rep levels #1355

AnIrishDuck · 2023-01-10T22:11:40Z

This one was fun. Porting back from WallarooLabs#7, with tests we can share upstream. I see there was a fix in here already, but it is not sufficient for our use case. I'm not sure if it clashes with our fix; with this both-fixes version, our internal tests pass along with the new unit test.

Parquet is a complex format, and I am not super familiar with this area of code. The solution here feels directionally correct, but incomplete. I have several concerns which I will elaborate on later.

First: the problem. A list array within a struct array was yielding incorrect and odd results.

I traced the root cause back to the rep level encoder in the writer. It was mistakenly adding rep levels for required values. A required value only has one valid repetition count: 1. Parquet is a space-conscious format and therefore nominally omits repetition counts when they are known. The parquet2 library would thus calculate the max rep encoding length from parquet types and get one value. This max length would not match the max length emitted by the writer.

So, if a struct required an inner array, the writer would treat values in that array as level-2 values instead of the appropriate level-1 value. The encoding width emitted by the rep level encoder in arrow2 would thus not match what parquet2 expects. The end result would basically just emit garbage when writing out rep levels, which meant the results read would have the correct underlying values with apparently random breaks between nested lists.

The fix is to omit rep levels for any required fields in the nesting stack. The resulting rep level width then matches what parquet2 expects, and we get the correct nested values.

There are a few further concerns:

The fix is complicated by the usage of num_values from rep.rs throughout the write code. This feels like an overloaded concept, as there are many cases (required values) where values are present but a rep level tape out is unnecessary.
There are probably several issues with this code when you start mixing nested null and empty values. As far as I can tell, the Dremel standard provides no way to distinguish between a null list and an empty list. I think this makes arbitrary round-trip conversions between arrow and parquet impossible? I can't find any examples of Dremel encoding that preserves the difference between the two, but my Google searches may be missing something.
I'm somewhat worried that this fix is incomplete. This is the second persistence bug we've discovered while using this library to read/write parquet. Would there be any concerns if we were to develop and donate property tests to the project? I'm thinking of using quickcheck to check that a random sample of arbitrarily nested arrow arrays round-trip properly to and from the parquet format, possibly modulo the concerns around null / empty lists discussed above.

tustvold · 2023-01-11T07:16:01Z

As far as I can tell, the Dremel standard provides no way to distinguish between a null list and an empty list.

FWIW this is not correct, a repeated field is always required, with its corresponding definition level always indicating an empty list. You can then nest this within a nullable group to represent nulls. This is documented here.

You may find this series I wrote on this helpful. You may also be able to crib from the parquet crates implementation of both the record shredding, schema inference and record delimiting necessary to make this use case work properly. I will forewarn you that this rabbit hole goes very deep 😅

AnIrishDuck · 2023-01-11T21:58:37Z

Ah, of course. Thanks for the correction; that didn't seem right, but I couldn't figure out how to make it work. If I'm reading this correctly, you basically need two levels of nesting in a parquet schema for every (nullable) list in arrow. If the outer level is not defined, you have a null array. If the inner level is not defined, you have an empty array.

Thanks for the linked resources as well. They're exactly what I was looking for.

jorgecarleitao · 2023-02-10T21:46:38Z

@AnIrishDuck thanks a lot for the PR!

Unfortunately, as you wrote, I think this was not sufficient. I think this was addressed on #1390

I agree that dremel is a pain ^^

AnIrishDuck · 2023-02-10T22:17:59Z

Yeah, I was always suspicious that my fix was incomplete, but it was the best I could come up with at the time.

Glad to see it addressed in a deeper and more comprehensive way.

Seems like I should update this to just include the previously-failing test and validate it works after #1390 ?

AnIrishDuck · 2023-02-10T23:22:35Z

Also, if we were to do the work, would some property testing of the parquet functionality in this crate be accepted? I'm thinking something that generates arbitrary arrow arrays, then verifies they round-trip through parquet back to the same array?

jorgecarleitao · 2023-02-11T07:26:01Z

Also, if we were to do the work, would some property testing of the parquet functionality in this crate be accepted? I'm thinking something that generates arbitrary arrow arrays, then verifies they round-trip through parquet back to the same array?

@AnIrishDuck That would be awesome! Not sure if you are familiar with proptest, but one idea here is to implement Arbitrary to generate arbitrary arrays and use proptest to generate them.

AnIrishDuck · 2023-02-14T20:43:00Z

Yup, that's exactly what I was thinking. No promises, but we hopefully will have some time this quarter to look into this.

AnIrishDuck · 2023-05-31T20:50:16Z

@jorgecarleitao - I finally have carved out some time to look into this. Two organizational questions:

Should I drop this PR and create a new one? Or just repurpose this one?
The arbitrary array generation code seems broadly useful? Should it live deep in the parquet tests or somewhere else in tests/it (perhaps tests/it/array)?

AnIrishDuck · 2023-05-31T21:09:04Z

Also, I gravitate towards quickcheck instead of proptest as it's less macro-heavy. Any objections, other considerations, or other alternatives I should explore?

AnIrishDuck · 2023-07-10T15:58:28Z

See #1519

Fix calculation of nested rep levels

9e329f7

tustvold mentioned this pull request Feb 8, 2023

RFC: Use Apache Arrow Parquet Crate pola-rs/polars#6735

Closed

jorgecarleitao mentioned this pull request Feb 10, 2023

Fixed error in writing sliced binary #1391

Merged

AnIrishDuck closed this Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix calculation of nested rep levels #1355

Fix calculation of nested rep levels #1355

AnIrishDuck commented Jan 10, 2023 •

edited

Loading

tustvold commented Jan 11, 2023

AnIrishDuck commented Jan 11, 2023

jorgecarleitao commented Feb 10, 2023 •

edited

Loading

AnIrishDuck commented Feb 10, 2023

AnIrishDuck commented Feb 10, 2023

jorgecarleitao commented Feb 11, 2023

AnIrishDuck commented Feb 14, 2023

AnIrishDuck commented May 31, 2023 •

edited

Loading

AnIrishDuck commented May 31, 2023

AnIrishDuck commented Jul 10, 2023

Fix calculation of nested rep levels #1355

Fix calculation of nested rep levels #1355

Conversation

AnIrishDuck commented Jan 10, 2023 • edited Loading

tustvold commented Jan 11, 2023

AnIrishDuck commented Jan 11, 2023

jorgecarleitao commented Feb 10, 2023 • edited Loading

AnIrishDuck commented Feb 10, 2023

AnIrishDuck commented Feb 10, 2023

jorgecarleitao commented Feb 11, 2023

AnIrishDuck commented Feb 14, 2023

AnIrishDuck commented May 31, 2023 • edited Loading

AnIrishDuck commented May 31, 2023

AnIrishDuck commented Jul 10, 2023

AnIrishDuck commented Jan 10, 2023 •

edited

Loading

jorgecarleitao commented Feb 10, 2023 •

edited

Loading

AnIrishDuck commented May 31, 2023 •

edited

Loading