-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String::to_lowercase does not get vectorized well contrary to code comments #123712
Comments
The way the loop ors the bits for the is-ascii check into a single usize doesn't seem ideal for vectorization. keeping the lanes independent should yield better autovectorization results. |
@the8472 Thanks for the suggestion. I already started with a simd implementation, but now checked again if the autovectorization could be improved. With the following main loop while slice.len() >= N {
let chunk = &slice[..N];
let mut is_ascii = true;
for j in 0..N {
is_ascii &= chunk[j] <= 127;
}
if !is_ascii {
break;
}
for j in 0..N {
out_slice[j] = MaybeUninit::new(convert(&chunk[j]));
}
i += N;
slice = &slice[N..];
out_slice = &mut out_slice[N..];
} The assembly and performance is indeed better, but there is still some weird shuffling and shifting going on: 0,08 │ 80:┌─→movdqu xmm3,XMMWORD PTR [rbx+rax*1] ▒
0,07 │ │ pshufd xmm4,xmm3,0xee ▒
0,37 │ │ por xmm4,xmm3 ▒
6,22 │ │ pshufd xmm5,xmm4,0x55 ▒
0,16 │ │ por xmm5,xmm4 ▒
│ │ movdqa xmm4,xmm5 ▒
2,02 │ │ psrld xmm4,0x10 ▒
4,48 │ │ por xmm4,xmm5 ▒
0,07 │ │ movdqa xmm5,xmm4 ▒
0,08 │ │ psrlw xmm5,0x8 ▒
0,60 │ │ por xmm5,xmm4 ▒
6,58 │ │ movd ecx,xmm5 ▒
│ │ test cl,cl ▒
│ │↓ js ef ▒
│ │ movdqa xmm4,xmm3 ▒
1,80 │ │ paddb xmm4,xmm0 ▒
4,70 │ │ movdqa xmm5,xmm4 ▒
│ │ pminub xmm5,xmm1 ▒
│ │ pcmpeqb xmm5,xmm4 ▒
0,37 │ │ pand xmm5,xmm2 ▒
8,50 │ │ por xmm5,xmm3 ▒
│ │ movdqu XMMWORD PTR [r13+rax*1+0x0],xmm5 ▒
│ │ add rax,0x10 ▒
1,94 │ │ add r12,0xfffffffffffffff0 ▒
1,05 │ ├──cmp r12,0xf ▒
4,19 │ └──ja 80 The explicit simd version looks a bit better, mostly because the ascii check directly translates to a const LANES: usize = 16;
let simd_range_start = Simd::splat(range_start);
let simd_range_end = Simd::splat(range_end);
let simd_xor_value = Simd::splat(xor_value);
while slice.len() >= LANES {
let chunk = Simd::<u8, LANES>::from_slice(slice);
let is_ascii = chunk.cast::<i8>().simd_ge(Simd::splat(0));
if is_ascii.all() {
let is_in_range = chunk.simd_ge(simd_range_start) & chunk.simd_le(simd_range_end);
let converted = is_in_range.select(chunk ^ simd_xor_value, chunk);
// SAFETY: output has enough capacity and we never read the uninitialized slice
unsafe {
let out_slice = core::slice::from_raw_parts_mut(out_ptr, LANES);
converted.copy_to_slice(out_slice);
out_ptr = out_ptr.add(LANES);
}
i += LANES;
slice = &slice[LANES..];
} else {
break;
}
} │ 80:┌─→movdqu xmm3,XMMWORD PTR [r14+rcx*1] ▒
0,12 │ │ pmovmskb edx,xmm3 ▒
0,12 │ │ test edx,edx ▒
│ │↓ jne d4 ▒
5,82 │ │ movdqa xmm4,xmm3 ▒
0,04 │ │ paddb xmm4,xmm0 ▒
│ │ movdqa xmm5,xmm4 ▒
0,08 │ │ pminub xmm5,xmm1 ▒
10,05 │ │ pcmpeqb xmm5,xmm4 ▒
│ │ movdqa xmm4,xmm5 ▒
│ │ pandn xmm4,xmm3 ▒
│ │ pxor xmm3,xmm2 ▒
7,02 │ │ pand xmm3,xmm5 ▒
0,04 │ │ por xmm3,xmm4 ▒
│ │ movdqu XMMWORD PTR [rax+rcx*1],xmm3 ▒
│ │ add rcx,0x10 ▒
9,69 │ │ mov rdx,rsi ▒
│ │ add rdx,0xfffffffffffffff0 ▒
│ │ mov rsi,rdx ▒
│ ├──cmp rdx,0xf ▒
5,03 │ └──ja 80 Do you have a preference here between autovectorization and explicit simd? |
If the SIMD impl results in reasonable code across architectures, including some without SIMD then that should be fine. If it's only meant to target x86 with SSE2 then that'd mean having multiple paths and in that case we'd still have to tweak the generic path anyway. At that point we might as well just optimize the latter further.
You can probably get rid of those too by keeping multiple bools in a small array. That way the optimizer will more easily see that it can shove them into independent simd lanes. Increasing the unroll count might help too to fit better into simd registers. We do get pretty decent autovectorization in other places in the standard library by massaging things into a state that basically looks like arrays-instead-of-SIMD-types. E.g. rust/library/core/src/str/iter.rs Lines 61 to 73 in 40cf1f9
|
It seems llvm is too "smart" for this trick, on its own that generates the exact same code. The let mut is_ascii = [false; N];
for j in 0..N {
is_ascii[j] = chunk[j] <= 127;
}
if is_ascii.into_iter().map(|x| x as u8).sum::<u8>() as usize != N {
break;
}
It only needs to be better than the autovectorized version on those platforms ;)
My initial idea was to gate the simd code on either SSE2 or Neon (possibly also Altivec and RiscV). I'd also add a scalar loop for the non-multiple-of-N remainder, so all-ascii strings are fully handled by specialized code. Currently this remainder goes through the generic But I agree that if similar code quality can be achieved with autovectorization, that would be preferable. I'll open a PR after a little bit more polishing the code. |
Refactor the code in the `convert_while_ascii` helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character. The runtime on a microbenchmark with a small ascii-only input decreases from ~55ns to ~18ns per iteration. The new implementation also reduces the amount of unsafe code and encapsulates all unsafe inside the helper function. Fixes rust-lang#123712
Refactor the code in the `convert_while_ascii` helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character. The runtime on microbenchmarks with ascii-only inputs improves between 2x for short and 7x for long inputs on x86_64 and aarch64. The new implementation also encapsulates all unsafe inside the `convert_while_ascii` function. Fixes rust-lang#123712
Refactor the code in the `convert_while_ascii` helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character. The runtime on microbenchmarks with ascii-only inputs improves between 1.5x for short and 4x for long inputs on x86_64 and aarch64. The new implementation also encapsulates all unsafe inside the `convert_while_ascii` function. Fixes rust-lang#123712
…-vectorization, r=the8472 Improve autovectorization of to_lowercase / to_uppercase functions Refactor the code in the `convert_while_ascii` helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character. The runtime on a microbenchmark with a small ascii-only input decreases from ~55ns to ~18ns per iteration. The new implementation also reduces the amount of unsafe code and encapsulates all unsafe inside the helper function. Fixes rust-lang#123712
…to-vectorization, r=the8472 Improve autovectorization of to_lowercase / to_uppercase functions Refactor the code in the `convert_while_ascii` helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character. The runtime on a microbenchmark with a small ascii-only input decreases from ~55ns to ~18ns per iteration. The new implementation also reduces the amount of unsafe code and encapsulates all unsafe inside the helper function. Fixes rust-lang#123712
Refactor the code in the `convert_while_ascii` helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character. The runtime on microbenchmarks with ascii-only inputs improves between 1.5x for short and 4x for long inputs on x86_64 and aarch64. The new implementation also encapsulates all unsafe inside the `convert_while_ascii` function. Fixes rust-lang#123712
…-vectorization, r=the8472 Improve autovectorization of to_lowercase / to_uppercase functions Refactor the code in the `convert_while_ascii` helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character. The runtime on a microbenchmark with a small ascii-only input decreases from ~55ns to ~18ns per iteration. The new implementation also reduces the amount of unsafe code and encapsulates all unsafe inside the helper function. Fixes rust-lang#123712
…-vectorization, r=the8472 Improve autovectorization of to_lowercase / to_uppercase functions Refactor the code in the `convert_while_ascii` helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character. The runtime on a microbenchmark with a small ascii-only input decreases from ~55ns to ~18ns per iteration. The new implementation also reduces the amount of unsafe code and encapsulates all unsafe inside the helper function. Fixes rust-lang#123712
…-vectorization, r=the8472 Improve autovectorization of to_lowercase / to_uppercase functions Refactor the code in the `convert_while_ascii` helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character. The runtime on a microbenchmark with a small ascii-only input decreases from ~55ns to ~18ns per iteration. The new implementation also reduces the amount of unsafe code and encapsulates all unsafe inside the helper function. Fixes rust-lang#123712
…-vectorization, r=the8472 Improve autovectorization of to_lowercase / to_uppercase functions Refactor the code in the `convert_while_ascii` helper function to make it more suitable for auto-vectorization and also process the full ascii prefix of the string. The generic case conversion logic will only be invoked starting from the first non-ascii character. The runtime on a microbenchmark with a small ascii-only input decreases from ~55ns to ~18ns per iteration. The new implementation also reduces the amount of unsafe code and encapsulates all unsafe inside the helper function. Fixes rust-lang#123712
I'm looking into the performance of
to_lowercase
/to_uppercase
on mostly ascii strings, using a small microbenchmark added tolibrary/alloc/benches/string.rs
.Using linux perf tooling I see that the hot part of the code is the following large loop, which despite heavy use of sse2 instructions only seems to process 32 bytes per iteration.
I don't see an easy way to improve the autovectorization of this code, but it should be relatively easy to explicitly vectorize it using
portable_simd
, and I would like to prepare such a PR if there are no objections. As far as I know,portable_simd
is already in use insidecore
, for example by #103779.The text was updated successfully, but these errors were encountered: