Fix calculation of nested rep levels #7
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This one was fun. To understand what's going on here, you'll need to understand the Dremel format used by parquet. Twitter's engineering blog has a good primer.
The Dremel paper has even more detailed examples.
The rep level encoder was mistakenly adding rep levels for required values. A required value only has one valid repetition count: 1. Parquet is a space-conscious format and therefore just omits repetition counts when they are known.
So, if a struct required an inner array, the writer would treat values in that array as level-2 values instead of the appropriate level-1 value. This would lead to a mismatch. The
parquet2
library would calculate the max rep encoding length from parquet types and get one value. The encodings emitted by the rep level encoder in arrow2 would have a different, higher max level. The end result would basically just emit garbage when writing out rep levels.The fix is to omit rep levels for any required fields in the nesting stack.
The fix is further complicated by the usage of
num_values
from rep levels throughout the write code. This feels like a hack, as there are many cases where values are present but a rep level tape out is unnecessary. Honestly, there are probably several issues with this code when you start mixing null and empty values. As far as I can tell, the Dremel standard provides no way to distinguish between a null list and an empty list. This makes arbitrary round-trip conversions between arrow and parquet impossible.