Allow for reading improperly encoded UINT_8 and UINT_16 Parquet data #7055
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #7040.
Rationale for this change
This is a potential fix for the case that a Parquet writer improperly writes non-zero bits for UINT8 and UINT16 logical types. For example,
238u8
becomes0xffffffee
when written to the Parquet file. In this case, the array cast from Parquet type to Arrow type currently fails.What changes are included in this PR?
Modifies
PrimitiveArrayReader
to explicitly handle conversion of Parquet physical type INT32 to Arrow UInt8 or UInt16.Leaving this as draft for now until some consensus can be reached in the community as to how this type of malformed data should be handled. The Parquet spec currently states that reader behavior in this case is undefined, so the current state of parquet-rs is perfectly fine. Also, testing this change will likely involve adding an improperly encoded file to parquet-testing.
Are there any user-facing changes?