Allow for reading improperly encoded UINT_8 and UINT_16 Parquet data #7055

etseidl · 2025-01-31T23:19:46Z

Which issue does this PR close?

Closes #7040.

Rationale for this change

This is a potential fix for the case that a Parquet writer improperly writes non-zero bits for UINT8 and UINT16 logical types. For example, 238u8 becomes 0xffffffee when written to the Parquet file. In this case, the array cast from Parquet type to Arrow type currently fails.

What changes are included in this PR?

Modifies PrimitiveArrayReader to explicitly handle conversion of Parquet physical type INT32 to Arrow UInt8 or UInt16.

Leaving this as draft for now until some consensus can be reached in the community as to how this type of malformed data should be handled. The Parquet spec currently states that reader behavior in this case is undefined, so the current state of parquet-rs is perfectly fine. Also, testing this change will likely involve adding an improperly encoded file to parquet-testing.

Are there any user-facing changes?

allow for reading improperly encode UINT_8 and UINT_16 parquet data

97418a7

github-actions bot added the parquet Changes to the parquet crate label Jan 31, 2025

etseidl changed the title ~~allow for reading improperly encode UINT_8 and UINT_16 parquet data~~ Allow for reading improperly encoded UINT_8 and UINT_16 Parquet data Jan 31, 2025

etseidl mentioned this pull request Jan 31, 2025

Allow Parquet reader to read incorrectly written (negative) uint8, uint16 values for compatibility #7040

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for reading improperly encoded UINT_8 and UINT_16 Parquet data #7055

Allow for reading improperly encoded UINT_8 and UINT_16 Parquet data #7055

etseidl commented Jan 31, 2025

Allow for reading improperly encoded UINT_8 and UINT_16 Parquet data #7055

Are you sure you want to change the base?

Allow for reading improperly encoded UINT_8 and UINT_16 Parquet data #7055

Conversation

etseidl commented Jan 31, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?