-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable casting between Utf8/LargeUtf8 and Binary/LargeBinary #3542
Conversation
e84afb3
to
5d63695
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is sound as it doesn't perform UTF-8 validation on the strings AFAICT.
I think we should implement StringArray <-> BinaryArray and LargeStringArray <-> LargeBinaryArray using the existing From implementations, that among other things avoid re-encoding offsets and perform optimised validation.
We can then implement Binary <-> LargeBinary and StringArray <-> LargeStringArray separately, ideally sharing logic with ListArray <-> LargeListArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold Sorry for late.
For the question about UTF-8 validation, this change doesn't change casting from binary/large binary to string/large string. I think this is the only scenarios we need to perform UTF-8 validation.
This change only adds:
- LargeString to Boolean, Boolean to LargeString
- Binary to LargeBinary, LargeBinary to Binary
- String to LargeBinary
- LargeString to Binary
For above cases, I think we don't need UTF-8 validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we support Binary to String/LargeString. This patch doesn't touch them. I will extend it to support LargeBinary to String/LargeString in other work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can go in as is, I hadn't noticed that cast_byte_container
was only being used in "valid" contexts. However, I think we should either make it unsafe, or restrict its signature to be sound prior to merge.
Benchmark runs are scheduled for baseline = d938cd9 and contender = bf21ad9. bf21ad9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #1179.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?