Improve performance of unstable sort #104116

Voultapher · 2022-11-07T17:24:12Z

This is a followup to #100856, this time speeding up slice::unstable_sort. Fundamentally it uses optimal sorting networks to speedup sorting small slices. Before going into too much detail on the speedup, the most important thing is correctness. It passes my test suite in https://github.com/Voultapher/sort-research-rs both with normal Rust and Miri. That includes tests not found or not found with the same rigor in the standard library tests, observable_is_less, panic_retain_original_set and violate_ord_retain_original_set. And from a code structure point, it copies several elements from slice::sort. They live in separate modules, and I don't know enough about the structure of the standard library to unify them. In essence I could image them both living in core and stable sorting requiring a passed in function that does the allocation. But even then I'm not sure how that affects LTO and inlining, which are critical for performance. In my repository I have all the implementations copied into individual modules, and I'm not sure how to test that in the standard library.

Speedups:

1.2-1.5x faster than existing implementation for random, saw and other patterns (integers)
Same performance for very densely distributed random inputs and ascending and descending inputs, len >= 40.
On average fewer comparisons for types that are judged cheap to move
Near perfect sorting of descending inputs starting from size 8.

To understand this PR, it's advisable to read #100856. There I go into more detail on the test methodology and graphs.

The full benchmark results can be found here: https://voultapher.github.io/sort-research-rs/results/e746538/

Here are the speedups and slowdowns for Zen3:

For hot-u64 we can see that the speedups are the most extreme for smaller sizes and level out at 30%.

For hot-string which is relatively expensive to access we see it level out at 4% while seeing 10-15% speedup for smaller sizes, and 3x speedup as the most extreme case hot-string-descending-20. These results are in-line with the average reduced comparison counts and the extreme speedups are explained by the ability to detect fully or mostly decreasing inputs even for small inputs. For smaller sizes there are also a noticeable amount of slowdowns. I'd argue that the overall speedup is worth it here, but this can be tuned with the qualifies_for_branchless_sort heuristic.

The 1k type generally produces the most noisy results, so I'm not sure this is a real signal. But it might be thanks to the tuned insert right and left functions.

The cheap to access but expensive to compare type f128 shows a modest speedup of 10-20%, with descending inputs again showing the largest outliers at a 5x speedup for descending-20, which switches from insertion sort to the more sophisticated sort_small.

The cold results don't look too hot for small sizes <= 20, these could be addressed by adding extra logic. But even then it's a tricky balance, where something like pattern analysis is really hard if not impossible to do without some slowdown for cold code. Also note, I'm not too sure about the my methodology for cold code here.

The speedups and slowdowns for Firestorm are relatively similar:

Comparison statistics.

Percent comparisons done more by rust_new_unstable than rust_std_unstable
[i32-ascending-20-plus]:              mean: -2%     min: -13%  max: 0%
[i32-ascending-20-sub]:               mean: 14%     min: 0%    max: 67%
[i32-ascending_saw_20-20-plus]:       mean: -7%     min: -19%  max: -2%
[i32-ascending_saw_20-20-sub]:        mean: 14%     min: 0%    max: 67%
[i32-ascending_saw_5-20-plus]:        mean: -8%     min: -24%  max: -3%
[i32-ascending_saw_5-20-sub]:         mean: 13%     min: -11%  max: 67%
[i32-descending-20-plus]:             mean: -47%    min: -240% max: 0%
[i32-descending-20-sub]:              mean: -396%   min: -900% max: 0%
[i32-descending_saw_20-20-plus]:      mean: -54%    min: -240% max: -3%
[i32-descending_saw_20-20-sub]:       mean: -396%   min: -900% max: 0%
[i32-descending_saw_5-20-plus]:       mean: -6%     min: -22%  max: -1%
[i32-descending_saw_5-20-sub]:        mean: -74%    min: -350% max: 0%
[i32-pipe_organ-20-plus]:             mean: -27%    min: -64%  max: -6%
[i32-pipe_organ-20-sub]:              mean: -54%    min: -143% max: 18%
[i32-random-20-plus]:                 mean: -6%     min: -14%  max: -3%
[i32-random-20-sub]:                  mean: -8%     min: -34%  max: 25%
[i32-random_binary-20-plus]:          mean: 13%     min: -13%  max: 85%
[i32-random_binary-20-sub]:           mean: 22%     min: 0%    max: 62%
[i32-random_dense-20-plus]:           mean: 7%      min: 0%    max: 35%
[i32-random_dense-20-sub]:            mean: 4%      min: -15%  max: 31%

With some exceptions, overall we see a nice average reduction of comparisons. That's the reasons why I tuned qualifies_for_branchless_sort to allow all types except those that are deemed expensive to move.

Binary bloat.

Compiling a simple hello world and sorting 6 different integer types and 2 different String containers, in release mode and stripping the binary yields:

Baseline:            300k
rust_std_unstable:   379k
rust_new_unstable:   425k

Why is this faster?

Moder hardware is so complex that all I can do is guess, but in essence if you look at the assembly call graph of insertion_sort_shift_left:

mbly

And compare that to sort8_optimal:

It might seem clear, why the sort8_optimal and it's siblings are able to extract more Instruction-Level-Parallelism (ILP) than insertion sort which is currently being used.

Another reason is, that pdqsort spends a lot of time in insertion sort:

So it's worthwhile to speed up this area. The kind of optimizations pdqsort does to avoid choosing a bad pivot or even switching to heapsort make a lot of sense at larger sizes, but are overkill for smaller sizes. Which are better addressed with something tailormade like sort_small.

At it's core it leverages branchless compare and swap operations employed in optimal sorting networks. A variety of strategies is used to optimize for hot and cold runtime, binary size, and max comparisons done. Many patterns see a reduction in average comparisons performed. So this improvement is applied to all types that are deemed cheap to move. It copies parts of the stable sort. Before this get's merged they ought to be unified.

rustbot · 2022-11-07T17:24:20Z

r? @m-ou-se

(rustbot has picked a reviewer for you, use r? to override)

rustbot · 2022-11-07T17:24:23Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

rust-log-analyzer · 2022-11-07T18:44:27Z

The job mingw-check failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

configure: rust.debug-assertions := True
configure: rust.overflow-checks := True
configure: llvm.assertions      := True
configure: dist.missing-tools   := True
configure: build.configure-args := ['--enable-sccache', '--disable-manage-submodu ...
configure: writing `config.toml` in current directory
configure: 
configure: run `python /checkout/x.py --help`
Attempting with retry: make prepare
---
tidy: Skipping binary file check, read-only filesystem
Checking which error codes lack tests...
* 632 error codes
* highest error code: E0790
tidy error: /checkout/library/core/src/slice/sort.rs:773: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:880: TODO is deprecated; use FIXME
tidy error: /checkout/library/core/src/slice/sort.rs:890: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:957: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1056: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1080: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1103: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1116: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1127: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1140: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1155: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1160: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1174: undocumented unsafe
tidy error: /checkout/library/core/src/slice/sort.rs:1492: TODO is deprecated; use FIXME
Found 0 error(s) in error codes
Done!
* 389 features
some tidy checks failed

the8472 · 2022-11-07T19:31:49Z

They live in separate modules, and I don't know enough about the structure of the standard library to unify them.

If this is about the core/alloc split then you can expose them as an unstable, hidden feature without tracking issue and use that feature in the other crate.

But even then I'm not sure how that affects LTO and inlining

Generic methods are always eligible for inlining because they are only instantiated once the concrete type is known.

Fundamentally it uses optimal sorting networks to speedup sorting small slices. Before going into too much detail on the speedup

Looking at the linked page it looks like sort networks optimize for short dependency chains, which should be good for ILP. But is there room for optimizing for cache effects too? I guess that's more of a research question than something that can be done in a PR.

Voultapher · 2022-11-08T20:06:31Z

FYI the CI fails because of a TODO comment in the code. That TODO is a question for the authors here, it would be relatively easy to re-use small_sort for slice::select_nth_unstable but I'm not too familiar with the code and the implications that would have and I don't have benchmarks for that. In addition I'd like to wait with this PR until the other one was reviewed and hopefully merged.

…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ``@Mark-Simulacrum`` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.

…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ```@Mark-Simulacrum``` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.

…homcc Unify stable and unstable sort implementations in same core module This moves the stable sort implementation to the core::slice::sort module. By virtue of being in core it can't access `Vec`. The two `Vec` used by merge sort, `buf` and `runs`, are modelled as custom types that implement the very limited required `Vec` interface with the help of provided allocation and free functions. This is done to allow future re-use of functions and logic between stable and unstable sort. Such as `insert_head`. This is in preparation of rust-lang#100856 and rust-lang#104116. It only moves code, it *doesn't* change any of the sort related logic. This unlocks the ability to share `insert_head`, `insert_tail`, `swap_if_less` `merge` and more. Tagging ````@Mark-Simulacrum```` I hope this allows progress on rust-lang#100856, by moving `merge_sort` here I hope future changes will be easier to review.

Voultapher · 2023-01-26T07:56:56Z

Closing for now, work will be resumed in other PRs. Most of the information in here is obsolete now, and I've gotten a lot further in optimising.

rustbot assigned m-ou-se Nov 7, 2022

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 7, 2022

Voultapher mentioned this pull request Nov 21, 2022

Unify stable and unstable sort implementations in same core module #104672

Merged

scottmcm mentioned this pull request Jan 17, 2023

Add heapsort fallback in select_nth_unstable #106997

Merged

onestacked mentioned this pull request Jan 21, 2023

[WIP] Constified slice::sort_unstable, sort_internals #102279

Closed

7 tasks

Voultapher closed this Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of unstable sort #104116

Improve performance of unstable sort #104116

Voultapher commented Nov 7, 2022

rustbot commented Nov 7, 2022

rustbot commented Nov 7, 2022

rust-log-analyzer commented Nov 7, 2022

the8472 commented Nov 7, 2022

Voultapher commented Nov 8, 2022

Voultapher commented Jan 26, 2023

Improve performance of unstable sort #104116

Improve performance of unstable sort #104116

Conversation

Voultapher commented Nov 7, 2022

rustbot commented Nov 7, 2022

rustbot commented Nov 7, 2022

rust-log-analyzer commented Nov 7, 2022

the8472 commented Nov 7, 2022

Voultapher commented Nov 8, 2022

Voultapher commented Jan 26, 2023