Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent UnionArray with Repeated Type IDs #3982

Closed
tustvold opened this issue Mar 30, 2023 · 6 comments · Fixed by #4070
Closed

Prevent UnionArray with Repeated Type IDs #3982

tustvold opened this issue Mar 30, 2023 · 6 comments · Fixed by #4070
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently DataType::Union has no protection against the same type ID appearing multiple times in its DataType.

Describe the solution you'd like

We should probably validate that the type_ids are unique, and prevent constructing an invalid DataType

Describe alternatives you've considered

We could not do this, much like currently permit invalid DataType::Decimal #2362

Additional context

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Mar 30, 2023
@Weijun-H
Copy link
Member

Could I pick this ticket up?

@tustvold
Copy link
Contributor Author

tustvold commented Apr 12, 2023

Sure, as the maximum id is 127, you can probably just use an u128 bitset to check for duplicates within UnionFields

@Weijun-H
Copy link
Member

Should we guarantee that typeid is unsigned? The current implementation does not guarantee it.

@tustvold
Copy link
Contributor Author

Should we guarantee that typeid is unsigned

The arrow format specifies the following - https://arrow.apache.org/docs/format/Columnar.html#union-layout

A buffer of 8-bit signed integers. Each type in the union has a corresponding type id whose values are found in this buffer. A union with more than 127 possible types can be modeled as a union of unions.

So they should be signed integers, but never negative 😅. Most likely this was originally a workaround for Java

@tustvold
Copy link
Contributor Author

label_issue.py automatically added labels {'arrow'} from #4070

@tustvold tustvold added the parquet Changes to the parquet crate label Apr 21, 2023
@tustvold
Copy link
Contributor Author

label_issue.py automatically added labels {'parquet'} from #3981

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants