Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix calculation of nested rep levels #7

Merged
merged 1 commit into from
Dec 21, 2022

Conversation

AnIrishDuck
Copy link
Collaborator

@AnIrishDuck AnIrishDuck commented Dec 20, 2022

This one was fun. To understand what's going on here, you'll need to understand the Dremel format used by parquet. Twitter's engineering blog has a good primer.

The Dremel paper has even more detailed examples.

The rep level encoder was mistakenly adding rep levels for required values. A required value only has one valid repetition count: 1. Parquet is a space-conscious format and therefore just omits repetition counts when they are known.

So, if a struct required an inner array, the writer would treat values in that array as level-2 values instead of the appropriate level-1 value. This would lead to a mismatch. The parquet2 library would calculate the max rep encoding length from parquet types and get one value. The encodings emitted by the rep level encoder in arrow2 would have a different, higher max level. The end result would basically just emit garbage when writing out rep levels.

The fix is to omit rep levels for any required fields in the nesting stack.

The fix is further complicated by the usage of num_values from rep levels throughout the write code. This feels like a hack, as there are many cases where values are present but a rep level tape out is unnecessary. Honestly, there are probably several issues with this code when you start mixing null and empty values. As far as I can tell, the Dremel standard provides no way to distinguish between a null list and an empty list. This makes arbitrary round-trip conversions between arrow and parquet impossible.

Copy link

@JONBRWN JONBRWN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to do some reading before reviewing, looks good overall. let's get this in so we can fix the issues we're seeing and then figure out how to get support to move this upstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants