std::str: optimize is_utf8 & first_non_utf8_index. #12241

huonw · 2014-02-13T12:55:00Z

This makes is_utf8 up to 40% faster

Before:

is_utf8_100_ascii               ... bench:       123 ns/iter (+/- 9)
is_utf8_100_multibyte           ... bench:       134 ns/iter (+/- 7)

from_utf8_lossy_100_ascii       ... bench:       118 ns/iter (+/- 8)
from_utf8_lossy_100_invalid     ... bench:       378 ns/iter (+/- 12)
from_utf8_lossy_100_multibyte   ... bench:       135 ns/iter (+/- 5)
from_utf8_lossy_invalid         ... bench:       129 ns/iter (+/- 10)

After:

is_utf8_100_ascii               ... bench:        77 ns/iter (+/- 4)
is_utf8_100_multibyte           ... bench:        96 ns/iter (+/- 9)

from_utf8_lossy_100_ascii       ... bench:        91 ns/iter (+/- 14)
from_utf8_lossy_100_invalid     ... bench:       414 ns/iter (+/- 30)
from_utf8_lossy_100_multibyte   ... bench:       119 ns/iter (+/- 10)
from_utf8_lossy_invalid         ... bench:       108 ns/iter (+/- 11)

This makes is_utf8 up to 40% faster Before: is_utf8_100_ascii ... bench: 123 ns/iter (+/- 9) is_utf8_100_multibyte ... bench: 134 ns/iter (+/- 7) from_utf8_lossy_100_ascii ... bench: 118 ns/iter (+/- 8) from_utf8_lossy_100_invalid ... bench: 378 ns/iter (+/- 12) from_utf8_lossy_100_multibyte ... bench: 135 ns/iter (+/- 5) from_utf8_lossy_invalid ... bench: 129 ns/iter (+/- 10) After: is_utf8_100_ascii ... bench: 77 ns/iter (+/- 4) is_utf8_100_multibyte ... bench: 96 ns/iter (+/- 9) from_utf8_lossy_100_ascii ... bench: 91 ns/iter (+/- 14) from_utf8_lossy_100_invalid ... bench: 414 ns/iter (+/- 30) from_utf8_lossy_100_multibyte ... bench: 119 ns/iter (+/- 10) from_utf8_lossy_invalid ... bench: 108 ns/iter (+/- 11)

huonw · 2014-02-13T12:57:01Z

I don't have a good explanation why 100_invalid would be slower, especially since first_non_utf8_index is called only once in it, at the very start.

cc @kballard

alexcrichton · 2014-02-13T13:37:11Z

I'm a little sad about all the new unsafe code as part of this commit, is there no way to limit the scope of the unsafe code or remove it entirely?

huonw · 2014-02-13T13:39:33Z

Well, it's having to manipulate pointers directly, so I find it unlikely that it can be anywhere near as efficient any other way. (e.g. manually stepping an iterator puts bounds checks on every access, like the old code.)

alexcrichton · 2014-02-13T13:44:01Z

I find it somewhat difficult to believe that it is impossible for safe code to reach this same level of performance.

I'm not saying that this should have it all removed, but we're getting into the habit of reaching for unsafe far too frequently that it's diminishing the usefulness of it. I feel like a good rule of thumb is that every unsafe block should be documented why it needs to be unsafe and why it can't be done with unsafe code (unless it's already apparent). Could you add some comments explaining why unsafe is used and why it's used in such a large scope?

lilyball · 2014-02-13T19:38:27Z

@huonw from_utf8_lossy_100_invalid is likely doing a bunch of reallocations. Each invalid byte is replaced by 3 bytes in the output. It's also writing 3 bytes a time instead of writing the entire contiguous good chunk, as happens with valid utf8.

lilyball · 2014-02-13T19:41:03Z

I'm willing to r=me this, but I will defer to @alexcrichton's caution.

alexcrichton · 2014-02-13T20:31:14Z

Can this be implemented with a byte iterator? That would eliminate all bounds checks and you'd only have the unwrap-related checks when dealing with wider characters. In theory we could extend the byte iterator to have a take2 method if we want to go even further reducing checks, but if bounds checks were the major issue it seems that a byte iterator should fix that?

huonw · 2014-02-13T20:55:01Z

@kballard both the new and old version are doing the same reallocations, so that doesn't explain why the timing is different between them.

@alexcrichton the byte iterator is doing a bound check on every access (to see if it's at the end or not), and it's not being used in a tight loop, so I find it unlikely that LLVM will be able to streamline it well; I'll have a go though.

lilyball · 2014-02-13T21:15:02Z

@huonw Oh I see what you're saying. Perhaps first_non_utf8_index has more overhead in the case where it fails at index 0 than the old implementation did? I'm curious what the benchmark change is for a test where the first byte is invalid but the rest of the string is valid.

ghost · 2014-02-13T22:04:56Z

@alexcrichton: Generally, iterator bounds checks are not optimized away at any level including LTO. It seems to me like LLVM does not have enough information.

huonw · 2014-02-15T05:51:57Z

I investigated using iterators, results:

test is_utf8_iter::is_utf8_100_ascii            ... bench:       144 ns/iter (+/- 10)
test is_utf8_iter::is_utf8_100_multibyte        ... bench:       120 ns/iter (+/- 12)
test is_utf8_iter::random                       ... bench:       494 ns/iter (+/- 33)
test is_utf8_iter::random_invalid               ... bench:       276 ns/iter (+/- 443)

test is_utf8_iter_inline::is_utf8_100_ascii     ... bench:       116 ns/iter (+/- 1)
test is_utf8_iter_inline::is_utf8_100_multibyte ... bench:       110 ns/iter (+/- 3)
test is_utf8_iter_inline::random                ... bench:       493 ns/iter (+/- 34)
test is_utf8_iter_inline::random_invalid        ... bench:       249 ns/iter (+/- 400)

test is_utf8_ptr::is_utf8_100_ascii             ... bench:        88 ns/iter (+/- 5)
test is_utf8_ptr::is_utf8_100_multibyte         ... bench:        97 ns/iter (+/- 4)
test is_utf8_ptr::random                        ... bench:       369 ns/iter (+/- 24)
test is_utf8_ptr::random_invalid                ... bench:       239 ns/iter (+/- 342)

test is_utf8_std::is_utf8_100_ascii             ... bench:       122 ns/iter (+/- 4)
test is_utf8_std::is_utf8_100_multibyte         ... bench:       134 ns/iter (+/- 9)
test is_utf8_std::random                        ... bench:       512 ns/iter (+/- 43)
test is_utf8_std::random_invalid                ... bench:       264 ns/iter (+/- 430)

std is the current implementation, ptr is this PR, iter and iter_inline are two iterator variants; source.

So the iterator versions is slightly faster than the current implementation (and with no unsafe, unlike the either the one in std or this PR's pointer one), but still significantly slower than the pointer version here. I think the the pure-ASCII case is hampered by #11751.

lilyball · 2014-02-15T05:55:23Z

Given how core a functionality is_utf8 is I'm in favor of going with the unsafe (but fastest) ptr approach.

huonw · 2014-02-15T06:00:01Z

If we do decide to go with this pointer approach, I'll write a proof of safety as comments in the source, as a substitute for compiler verification.

alexcrichton · 2014-02-15T22:48:32Z

Thank you for the thorough investigation, it's quite useful!

I'm not quite ready to give up this battle though, I was curious why the pure-ascii case with iterators was so much slower than with pointers, so I did some investigation of my own. With this code I got these timings:

$ rustc --opt-level=3 gistfile1.rs --test -Zlto && ./gistfile1 --bench is_utf8_100_ascii 

running 2 tests
test is_utf8_iter_inline::is_utf8_100_ascii ... bench:        95 ns/iter (+/- 2)
test is_utf8_ptr::is_utf8_100_ascii         ... bench:        63 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured

These results lead me to believe that the vector iterator isn't quite as good as it could be, and it's leading us to unsafe code in this implementation. With this in mind, I continue to be hesitant to merging this. I believe that the speedup from using unsafe code is at least partly a result of this vector iterator performance issue.

alexcrichton · 2014-02-15T22:58:37Z

@huonw pointed me at #11751 on IRC, which is in turn caused by #9546, which is the cause of the slowdown that I was seeing.

I am tempted to postpone the unsafe change until #9546 is fixed, I would be more than happy to accept the iterator changes, however.

huonw · 2014-02-16T08:53:30Z

Iterators in #12314.

…t, r=jonas-schievink fix: revert float parsing "fix" to avoid macro-related panics Reverts rust-lang/rust-analyzer#12149 and the follow-up fixes, while keeping their tests. rust-lang/rust-analyzer#12149 has caused many unexpected panics related to macros, and the fixes for those are not straightforward and further complicate the MBE token conversion logic, which was already fairly hard to follow before these fixes.

huonw closed this Feb 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std::str: optimize is_utf8 & first_non_utf8_index. #12241

std::str: optimize is_utf8 & first_non_utf8_index. #12241

huonw commented Feb 13, 2014

huonw commented Feb 13, 2014

alexcrichton commented Feb 13, 2014

huonw commented Feb 13, 2014

alexcrichton commented Feb 13, 2014

lilyball commented Feb 13, 2014

lilyball commented Feb 13, 2014

alexcrichton commented Feb 13, 2014

huonw commented Feb 13, 2014

lilyball commented Feb 13, 2014

ghost commented Feb 13, 2014

huonw commented Feb 15, 2014

lilyball commented Feb 15, 2014

huonw commented Feb 15, 2014

alexcrichton commented Feb 15, 2014

alexcrichton commented Feb 15, 2014

huonw commented Feb 16, 2014

std::str: optimize is_utf8 & first_non_utf8_index. #12241

std::str: optimize is_utf8 & first_non_utf8_index. #12241

Conversation

huonw commented Feb 13, 2014

huonw commented Feb 13, 2014

alexcrichton commented Feb 13, 2014

huonw commented Feb 13, 2014

alexcrichton commented Feb 13, 2014

lilyball commented Feb 13, 2014

lilyball commented Feb 13, 2014

alexcrichton commented Feb 13, 2014

huonw commented Feb 13, 2014

lilyball commented Feb 13, 2014

ghost commented Feb 13, 2014

huonw commented Feb 15, 2014

lilyball commented Feb 15, 2014

huonw commented Feb 15, 2014

alexcrichton commented Feb 15, 2014

alexcrichton commented Feb 15, 2014

huonw commented Feb 16, 2014