Faster utf8 validation: ~1.14-2x improvement #823

Dandandan · 2022-02-07T18:22:02Z

Two improvements for try_check_offsets_and_utf8

check bounds afterwards for the last value only as it is the largest value which is guaranteed by monotonicity check (same as try_check_offsets).
run simdutf8::basic::from_utf8 on the entire buffer instead of individual slices and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.

Change is up to 2x on smaller inputs (that failed to use the simd code before which depends on larger inputs)

read utf8 large 2^14    time:   [1.2200 ms 1.2232 ms 1.2262 ms]                                  
                        change: [-13.218% -12.371% -11.592%] (p = 0.00 < 0.05)
                        Performance has improved.

read utf8 emoji 2^14    time:   [130.95 us 131.35 us 131.82 us]                                 
                        change: [-49.840% -49.615% -49.358%] (p = 0.00 < 0.05)
                        Performance has improved.

codecov · 2022-02-07T18:30:17Z

Codecov Report

Merging #823 (00bba47) into main (1dd1b19) will increase coverage by 0.00%.
The diff coverage is 87.50%.

@@           Coverage Diff           @@
##             main     #823   +/-   ##
=======================================
  Coverage   71.63%   71.63%           
=======================================
  Files         326      326           
  Lines       17523    17525    +2     
=======================================
+ Hits        12552    12554    +2     
  Misses       4971     4971

Impacted Files	Coverage Δ
src/array/specification.rs	`91.89% <87.50%> (+3.32%)`	⬆️
src/compute/arithmetics/time.rs	`25.68% <0.00%> (-0.92%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1dd1b19...00bba47. Read the comment docs.

ritchie46 · 2022-02-07T18:41:43Z

run simdutf8::basic::from_utf8 on the entire buffer instead of individual strings and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.

We tried this one before, but sadly we cannot do that optimization. There is no guarantee that a correct entire buffer contains correct utf8 elements. For instance the letter pi is: [207, 128] which is non-valid utf8 if we split it.

Or is this sound in combination with the is_char_boundary check? 🤔 ?

Dandandan · 2022-02-07T18:52:55Z

128

run simdutf8::basic::from_utf8 on the entire buffer instead of individual strings and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.

We tried this one before, but sadly we cannot do that optimization. There is no guarantee that a correct entire buffer contains correct utf8 elements. For instance the letter pi is: [207, 128] which is non-valid utf8 if we split it.

Yes, I am aware of this. The letter pi with other offsets will not validate as the 128 is not a valid start of the sequence (128u8 as i8) < -0x40) evaluates to true, so will give an error.

ritchie46 · 2022-02-07T18:58:51Z

128

run simdutf8::basic::from_utf8 on the entire buffer instead of individual strings and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.

We tried this one before, but sadly we cannot do that optimization. There is no guarantee that a correct entire buffer contains correct utf8 elements. For instance the letter pi is: [207, 128] which is non-valid utf8 if we split it.

Yes, I am aware of this. The letter pi with other offsets will not validate as the 128 is not a valid start of the sequence (128u8 as i8) < -0x40) evaluates to true, so will give an error.

Yes, I understand the rationale now. So this is sound under the assumption that all multi-byte unicode chars contain values > 127 right? Really interesting observation! 🙌

Dandandan · 2022-02-07T19:25:46Z

128

run simdutf8::basic::from_utf8 on the entire buffer instead of individual strings and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.

We tried this one before, but sadly we cannot do that optimization. There is no guarantee that a correct entire buffer contains correct utf8 elements. For instance the letter pi is: [207, 128] which is non-valid utf8 if we split it.

Yes, I am aware of this. The letter pi with other offsets will not validate as the 128 is not a valid start of the sequence (128u8 as i8) < -0x40) evaluates to true, so will give an error.

Yes, I understand the rationale now. So this is sound under the assumption that all multi-byte unicode chars contain values > 127 right? Really interesting observation! 🙌

Yeah, every second/third/fourth byte always starts with bits "10" which is what this condition checks for in a clever way.

Also see the encoding table here
https://en.m.wikipedia.org/wiki/UTF-8

jorgecarleitao · 2022-02-07T20:22:45Z

I PRed a prop test for this to give us extra safety. Really cool optimization!

Added proptest on utf8 validation

Faster utf8 validation

5728dd4

jorgecarleitao and others added 2 commits February 7, 2022 19:47

Added proptest on utf8 val

3e2e914

Add emoji test

c0006c7

Add emoji test

a88dbd5

Dandandan changed the title ~~Faster utf8 validation: ~10% improvement~~ Faster utf8 validation: ~1.15-2x improvement Feb 7, 2022

Dandandan changed the title ~~Faster utf8 validation: ~1.15-2x improvement~~ Faster utf8 validation: ~1.14-2x improvement Feb 7, 2022

Dandandan added 3 commits February 7, 2022 21:38

Fix test

a976179

Merge pull request #1 from jorgecarleitao/add_check

7c37fbc

Added proptest on utf8 validation

Fix tests

00bba47

jorgecarleitao merged commit 7e25ef2 into jorgecarleitao:main Feb 8, 2022

ritchie46 mentioned this pull request Feb 8, 2022

Improve performance of utf8 validation in csv reader pola-rs/polars#2577

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster utf8 validation: ~1.14-2x improvement #823

Faster utf8 validation: ~1.14-2x improvement #823

Dandandan commented Feb 7, 2022 •

edited

Loading

codecov bot commented Feb 7, 2022 •

edited

Loading

ritchie46 commented Feb 7, 2022 •

edited

Loading

Dandandan commented Feb 7, 2022

ritchie46 commented Feb 7, 2022

Dandandan commented Feb 7, 2022 •

edited

Loading

jorgecarleitao commented Feb 7, 2022

Faster utf8 validation: ~1.14-2x improvement #823

Faster utf8 validation: ~1.14-2x improvement #823

Conversation

Dandandan commented Feb 7, 2022 • edited Loading

codecov bot commented Feb 7, 2022 • edited Loading

Codecov Report

ritchie46 commented Feb 7, 2022 • edited Loading

Dandandan commented Feb 7, 2022

ritchie46 commented Feb 7, 2022

Dandandan commented Feb 7, 2022 • edited Loading

jorgecarleitao commented Feb 7, 2022

Dandandan commented Feb 7, 2022 •

edited

Loading

codecov bot commented Feb 7, 2022 •

edited

Loading

ritchie46 commented Feb 7, 2022 •

edited

Loading

Dandandan commented Feb 7, 2022 •

edited

Loading