Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Rollup merge of rust-lang#59283 - SimonSapin:branchless-ascii-case, r…
…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```
- Loading branch information