Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for reading improperly encoded UINT_8 and UINT_16 Parquet data #7055

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jan 31, 2025

Which issue does this PR close?

Closes #7040.

Rationale for this change

This is a potential fix for the case that a Parquet writer improperly writes non-zero bits for UINT8 and UINT16 logical types. For example, 238u8 becomes 0xffffffee when written to the Parquet file. In this case, the array cast from Parquet type to Arrow type currently fails.

What changes are included in this PR?

Modifies PrimitiveArrayReader to explicitly handle conversion of Parquet physical type INT32 to Arrow UInt8 or UInt16.

Leaving this as draft for now until some consensus can be reached in the community as to how this type of malformed data should be handled. The Parquet spec currently states that reader behavior in this case is undefined, so the current state of parquet-rs is perfectly fine. Also, testing this change will likely involve adding an improperly encoded file to parquet-testing.

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 31, 2025
@etseidl etseidl changed the title allow for reading improperly encode UINT_8 and UINT_16 parquet data Allow for reading improperly encoded UINT_8 and UINT_16 Parquet data Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow Parquet reader to read incorrectly written (negative) uint8, uint16 values for compatibility
1 participant