-
Notifications
You must be signed in to change notification settings - Fork 225
Faster utf8 validation: ~1.14-2x improvement #823
Conversation
Codecov Report
@@ Coverage Diff @@
## main #823 +/- ##
=======================================
Coverage 71.63% 71.63%
=======================================
Files 326 326
Lines 17523 17525 +2
=======================================
+ Hits 12552 12554 +2
Misses 4971 4971
Continue to review full report at Codecov.
|
We tried this one before, but sadly we cannot do that optimization. There is no guarantee that a correct entire buffer contains correct utf8 elements. For instance the letter Or is this sound in combination with the |
Yes, I am aware of this. The letter pi with other offsets will not validate as the 128 is not a valid start of the sequence |
Yes, I understand the rationale now. So this is sound under the assumption that all multi-byte unicode chars contain values > 127 right? Really interesting observation! 🙌 |
Yeah, every second/third/fourth byte always starts with bits "10" which is what this condition checks for in a clever way. Also see the encoding table here |
I PRed a prop test for this to give us extra safety. Really cool optimization! |
Added proptest on utf8 validation
Two improvements for
try_check_offsets_and_utf8
try_check_offsets
).simdutf8::basic::from_utf8
on the entire buffer instead of individual slices and validate whether the start of the utf8 sequence is correct. I am not 100% sure it is correct, this is inspired by a recent PR to arrow-rs. It checks whether the first byte is a valid byte to start the sequence.Change is up to 2x on smaller inputs (that failed to use the simd code before which depends on larger inputs)