Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Fix calculation of nested rep levels #1355

Closed

Conversation

AnIrishDuck
Copy link
Contributor

@AnIrishDuck AnIrishDuck commented Jan 10, 2023

This one was fun. Porting back from WallarooLabs#7, with tests we can share upstream. I see there was a fix in here already, but it is not sufficient for our use case. I'm not sure if it clashes with our fix; with this both-fixes version, our internal tests pass along with the new unit test.

Parquet is a complex format, and I am not super familiar with this area of code. The solution here feels directionally correct, but incomplete. I have several concerns which I will elaborate on later.

First: the problem. A list array within a struct array was yielding incorrect and odd results.

I traced the root cause back to the rep level encoder in the writer. It was mistakenly adding rep levels for required values. A required value only has one valid repetition count: 1. Parquet is a space-conscious format and therefore nominally omits repetition counts when they are known. The parquet2 library would thus calculate the max rep encoding length from parquet types and get one value. This max length would not match the max length emitted by the writer.

So, if a struct required an inner array, the writer would treat values in that array as level-2 values instead of the appropriate level-1 value. The encoding width emitted by the rep level encoder in arrow2 would thus not match what parquet2 expects. The end result would basically just emit garbage when writing out rep levels, which meant the results read would have the correct underlying values with apparently random breaks between nested lists.

The fix is to omit rep levels for any required fields in the nesting stack. The resulting rep level width then matches what parquet2 expects, and we get the correct nested values.

There are a few further concerns:

  • The fix is complicated by the usage of num_values from rep.rs throughout the write code. This feels like an overloaded concept, as there are many cases (required values) where values are present but a rep level tape out is unnecessary.
  • There are probably several issues with this code when you start mixing nested null and empty values. As far as I can tell, the Dremel standard provides no way to distinguish between a null list and an empty list. I think this makes arbitrary round-trip conversions between arrow and parquet impossible? I can't find any examples of Dremel encoding that preserves the difference between the two, but my Google searches may be missing something.
  • I'm somewhat worried that this fix is incomplete. This is the second persistence bug we've discovered while using this library to read/write parquet. Would there be any concerns if we were to develop and donate property tests to the project? I'm thinking of using quickcheck to check that a random sample of arbitrarily nested arrow arrays round-trip properly to and from the parquet format, possibly modulo the concerns around null / empty lists discussed above.

@tustvold
Copy link
Contributor

As far as I can tell, the Dremel standard provides no way to distinguish between a null list and an empty list.

FWIW this is not correct, a repeated field is always required, with its corresponding definition level always indicating an empty list. You can then nest this within a nullable group to represent nulls. This is documented here.

You may find this series I wrote on this helpful. You may also be able to crib from the parquet crates implementation of both the record shredding, schema inference and record delimiting necessary to make this use case work properly. I will forewarn you that this rabbit hole goes very deep 😅

@AnIrishDuck
Copy link
Contributor Author

Ah, of course. Thanks for the correction; that didn't seem right, but I couldn't figure out how to make it work. If I'm reading this correctly, you basically need two levels of nesting in a parquet schema for every (nullable) list in arrow. If the outer level is not defined, you have a null array. If the inner level is not defined, you have an empty array.

Thanks for the linked resources as well. They're exactly what I was looking for.

@jorgecarleitao
Copy link
Owner

jorgecarleitao commented Feb 10, 2023

@AnIrishDuck thanks a lot for the PR!

Unfortunately, as you wrote, I think this was not sufficient. I think this was addressed on #1390

I agree that dremel is a pain ^^

@AnIrishDuck
Copy link
Contributor Author

Yeah, I was always suspicious that my fix was incomplete, but it was the best I could come up with at the time.

Glad to see it addressed in a deeper and more comprehensive way.

Seems like I should update this to just include the previously-failing test and validate it works after #1390 ?

@AnIrishDuck
Copy link
Contributor Author

Also, if we were to do the work, would some property testing of the parquet functionality in this crate be accepted? I'm thinking something that generates arbitrary arrow arrays, then verifies they round-trip through parquet back to the same array?

@jorgecarleitao
Copy link
Owner

Also, if we were to do the work, would some property testing of the parquet functionality in this crate be accepted? I'm thinking something that generates arbitrary arrow arrays, then verifies they round-trip through parquet back to the same array?

@AnIrishDuck That would be awesome! Not sure if you are familiar with proptest, but one idea here is to implement Arbitrary to generate arbitrary arrays and use proptest to generate them.

@AnIrishDuck
Copy link
Contributor Author

Yup, that's exactly what I was thinking. No promises, but we hopefully will have some time this quarter to look into this.

@AnIrishDuck
Copy link
Contributor Author

AnIrishDuck commented May 31, 2023

@jorgecarleitao - I finally have carved out some time to look into this. Two organizational questions:

  • Should I drop this PR and create a new one? Or just repurpose this one?
  • The arbitrary array generation code seems broadly useful? Should it live deep in the parquet tests or somewhere else in tests/it (perhaps tests/it/array)?

@AnIrishDuck
Copy link
Contributor Author

Also, I gravitate towards quickcheck instead of proptest as it's less macro-heavy. Any objections, other considerations, or other alternatives I should explore?

@AnIrishDuck
Copy link
Contributor Author

See #1519

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants