Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Faster utf8 validation: ~1.14-2x improvement #823

Merged
merged 7 commits into from
Feb 8, 2022

Conversation

Dandandan
Copy link
Collaborator

@Dandandan Dandandan commented Feb 7, 2022

Two improvements for try_check_offsets_and_utf8

  • check bounds afterwards for the last value only as it is the largest value which is guaranteed by monotonicity check (same as try_check_offsets).
  • run simdutf8::basic::from_utf8 on the entire buffer instead of individual slices and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.

Change is up to 2x on smaller inputs (that failed to use the simd code before which depends on larger inputs)

read utf8 large 2^14    time:   [1.2200 ms 1.2232 ms 1.2262 ms]                                  
                        change: [-13.218% -12.371% -11.592%] (p = 0.00 < 0.05)
                        Performance has improved.

read utf8 emoji 2^14    time:   [130.95 us 131.35 us 131.82 us]                                 
                        change: [-49.840% -49.615% -49.358%] (p = 0.00 < 0.05)
                        Performance has improved.

@codecov
Copy link

codecov bot commented Feb 7, 2022

Codecov Report

Merging #823 (00bba47) into main (1dd1b19) will increase coverage by 0.00%.
The diff coverage is 87.50%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #823   +/-   ##
=======================================
  Coverage   71.63%   71.63%           
=======================================
  Files         326      326           
  Lines       17523    17525    +2     
=======================================
+ Hits        12552    12554    +2     
  Misses       4971     4971           
Impacted Files Coverage Δ
src/array/specification.rs 91.89% <87.50%> (+3.32%) ⬆️
src/compute/arithmetics/time.rs 25.68% <0.00%> (-0.92%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1dd1b19...00bba47. Read the comment docs.

@ritchie46
Copy link
Collaborator

ritchie46 commented Feb 7, 2022

run simdutf8::basic::from_utf8 on the entire buffer instead of individual strings and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.

We tried this one before, but sadly we cannot do that optimization. There is no guarantee that a correct entire buffer contains correct utf8 elements. For instance the letter pi is: [207, 128] which is non-valid utf8 if we split it.

Or is this sound in combination with the is_char_boundary check? 🤔 ?

@Dandandan
Copy link
Collaborator Author

128

run simdutf8::basic::from_utf8 on the entire buffer instead of individual strings and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.

We tried this one before, but sadly we cannot do that optimization. There is no guarantee that a correct entire buffer contains correct utf8 elements. For instance the letter pi is: [207, 128] which is non-valid utf8 if we split it.

Yes, I am aware of this. The letter pi with other offsets will not validate as the 128 is not a valid start of the sequence (128u8 as i8) < -0x40) evaluates to true, so will give an error.

@ritchie46
Copy link
Collaborator

128

run simdutf8::basic::from_utf8 on the entire buffer instead of individual strings and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.

We tried this one before, but sadly we cannot do that optimization. There is no guarantee that a correct entire buffer contains correct utf8 elements. For instance the letter pi is: [207, 128] which is non-valid utf8 if we split it.

Yes, I am aware of this. The letter pi with other offsets will not validate as the 128 is not a valid start of the sequence (128u8 as i8) < -0x40) evaluates to true, so will give an error.

Yes, I understand the rationale now. So this is sound under the assumption that all multi-byte unicode chars contain values > 127 right? Really interesting observation! 🙌

@Dandandan
Copy link
Collaborator Author

Dandandan commented Feb 7, 2022

128

run simdutf8::basic::from_utf8 on the entire buffer instead of individual strings and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.

We tried this one before, but sadly we cannot do that optimization. There is no guarantee that a correct entire buffer contains correct utf8 elements. For instance the letter pi is: [207, 128] which is non-valid utf8 if we split it.

Yes, I am aware of this. The letter pi with other offsets will not validate as the 128 is not a valid start of the sequence (128u8 as i8) < -0x40) evaluates to true, so will give an error.

Yes, I understand the rationale now. So this is sound under the assumption that all multi-byte unicode chars contain values > 127 right? Really interesting observation! 🙌

Yeah, every second/third/fourth byte always starts with bits "10" which is what this condition checks for in a clever way.

Also see the encoding table here
https://en.m.wikipedia.org/wiki/UTF-8

@jorgecarleitao
Copy link
Owner

I PRed a prop test for this to give us extra safety. Really cool optimization!

@Dandandan Dandandan changed the title Faster utf8 validation: ~10% improvement Faster utf8 validation: ~1.15-2x improvement Feb 7, 2022
@Dandandan Dandandan changed the title Faster utf8 validation: ~1.15-2x improvement Faster utf8 validation: ~1.14-2x improvement Feb 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants