-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add slice::sort_by_cached_key as a memoised sort_by_key #48639
Conversation
r? @bluss (rust_highfive has picked a reviewer for you, use r? to override) |
cc @stjepang |
The numbers for the case of a very expensive key function are promising, but can we get some numbers for trivial key functions as well to quantify how much slower those become? |
I'm not feeling very inspired when it comes to comparators, so I've added some simpler tests (that may or may not be semantically meaningful) — you can see there is a slight regression for extremely simple keys. If you have any suggestions for benchmarks, I'll add them above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR! The benchmark numbers look great. :)
The part after sorting that applies the permutation to the original slice looks a little suspicious. Does it work correctly and in linear time?
AFAIK, the best way of applying a permutation is by traversing each cycle in it, swapping elements in the original slice and updating indices in the permutation as you go from one index to the next. See this article for an example. Your code looks much simpler than that, which is why I'm asking. :)
Could you do a few more benchmarks? Let's make sure that this optimization doesn't regress in any cases. In particular, I'm interested in the following:
- Sort 50000 random integers using
v.sort_by_key(|x| -x)
- Sort 50000 random strings using
v.sort_by_key(|x| x.len())
Also, I think we should avoid the allocation when sorting short slices. Here's a suggestion:
// Play a little with benchmarks to find the best constant...
const ARRAY_LEN: usize = 16;
let mut array: [ManuallyDrop<(K, usize)>; ARRAY_LEN];
let mut vec: Vec<ManuallyDrop<(K, usize)>>;
let mut indices: &mut [ManuallyDrop<(K, usize)>];
if self.len() <= ARRAY_LEN {
unsafe {
array = mem::uninitialized();
for ((i, k), elem) in self.iter().map(f).enumerate().zip(array.iter_mut()) {
ptr::write(elem, ManuallyDrop::new((k, i)));
}
}
indices = &mut array[0 .. self.len()];
} else {
vec = self.iter().map(f).enumerate().map(|(i, k)| ManuallyDrop::new((k, i))).collect();
indices = &mut vec[..];
}
indices.sort_unstable();
// ... use `indices` to permute `self`
for elem in indices[0 .. self.len()] {
unsafe {
ManuallyDrop::drop(elem);
}
}
Finally, regarding sort_unstable_key
- unfortunately, we cannot use dynamic allocation with it, but we can still optimize it somewhat. Note that the partitioning phase picks a pivot and then compares every element in the slice with it. Currently, we're recomputing the key for the pivot on each comparison, but we could compute it just once before partitioning. That'd effectively halve the number of key computations.
src/liballoc/slice.rs
Outdated
index = indices[index].1; | ||
} | ||
self.swap(i, index); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the time complexity of this for
loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was too concerned about correctness to notice that I might have slipped up with the time complexity! I think this is quadratic in the worst case, given some thought. I'll switch to the one you suggested — that seems like a safer choice in any respect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I can easily modify my method to reduce it to linear. It also seems to have a lower constant factor overhead than the one in the article, so a win both ways :)
src/liballoc/slice.rs
Outdated
@@ -1315,8 +1314,7 @@ impl<T> [T] { | |||
/// It is designed to be very fast in cases where the slice is nearly sorted, or consists of | |||
/// two or more sorted sequences concatenated one after another. | |||
/// | |||
/// Also, it allocates temporary storage half the size of `self`, but for short slices a | |||
/// non-allocating insertion sort is used instead. | |||
/// The algorithm allocates temporary storage the size of `self`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be a bit more precise here. Temporary storage is a Vec<(K, usize)>
of the same length as self
.
src/liballoc/slice.rs
Outdated
@@ -1328,10 +1326,21 @@ impl<T> [T] { | |||
/// ``` | |||
#[stable(feature = "slice_sort_by_key", since = "1.7.0")] | |||
#[inline] | |||
pub fn sort_by_key<B, F>(&mut self, mut f: F) | |||
pub fn sort_by_key<B, F>(&mut self, f: F) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: B
is an odd choice for the key type. :)
How about we change it to K
here and in sort_unstable_by_key
?
src/liballoc/slice.rs
Outdated
/// It is typically faster than stable sorting, except in a few special cases, e.g. when the | ||
/// slice consists of several concatenated sorted sequences. | ||
/// Due to its key calling strategy, `sort_unstable_by_key` is likely to be slower than | ||
/// `sort_by_key` in cases where the key function is expensive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to turn sort_unstable_by_key
and sort_by_key
into actual links to their documentation.
I would a priori assume that this change would not be in our plans.. such a decorate sort undecorated solution is quite different from the current sort by key, and could exist along side it. If this is in the absolute majority of cases a better way to do it, then it makes sense to switch. Making the sort indirect (on indices) is a loss for cache efficiency compared to the original sort and a cheap key function. I may not have the time to review this properly, but stjepang's review is of course more than enough. |
Let's not forget to run benchmarks on an ARM processor and make sure there are no significant regressions. Another idea for improving performance: use u32 rather than usize for indices whenever possible. @bluss The sort is not indirect. We're sorting pairs of values and indices, not bare indices. We lose some cache efficiency only by the fact that every element is one word larger. |
Yes, that's a good point. The sorting has the relatively cache local behavior of merge sort, but the apply permutation step doesn't. |
Can you try a benchmark for sorting by a https://doc.rust-lang.org/std/cmp/struct.Reverse.html key? If that regresses (the easiest "sort it descending" way of which I know), then I'm against this. It feels likely that people would know when their comparator is expensive, and thus could explicitly do it a different way. Maybe there's a way we can provide an "apply this permutation" helper to make doing the Schwartzian transform version easy? That can use |
Sorting a random
Results on my x86-64 laptop and ARMv7 Raspberry Pi:
While Unless we find a way to somehow alleviate this problem, I find it a deal breaker at the moment. |
After addressing some accidental quadraticity, here are some new benchmark results as requested by @stjepang and @scottmcm:
Bearing in mind that these keys are essentially worst cases for the new method (i.e. functions with very low overhead), the fact it now outperforms the original is very encouraging. However, @stjepang's suggestion to use |
My ARM machine is 32-bit so unfortunately using Note that, currently, the fastest and easiest way of sorting a slice in descending order is typically by sorting by By the way, another issue we haven't really explored is performance on pre-sorted (ascending/descending) and partially sorted (pre-sorted intervals or subsequences) inputs. Those cases mustn't regress too much either. |
@stjepang: Ah, okay — yeah, that's frustrating. I'll give it some thought. At worst, we could split the method off into two variants, but it'd be nice if we could optimise this one enough that it was never significantly worse than the original. |
@scottmcm @stjepang: just so I have a bit more context in trying to figure how to improve this case, what's the motivation for doing In addition, even switching to the unstable sort with this change would fix any regressions. So to be clear, the regression here seems to be: large performance-sensitive stable sorts with trivial keys on ARM (if I haven't missed something). To be clear, I'm not against adding a new method specifically for a "cached key" sort, but it would certainly be simpler for users if we could boost the performance of 90% of their sort-by-keys without any effort on their part :) |
Are you sure? Reversing the slice after sorting in ascending order is just extra work. And
Yes, pretty much. Actually, the regression probably appears on all machines which don't have highly parallel instruction pipelining (only modern x86-64 machines do, AFAIK).
Do you perhaps have any concrete examples of how you're using My guess is that most of the time the key extraction function is slow because it's just a |
The three methods are very close, but a representative benchmark (on my machine, at least):
I haven't really looked into why this is — perhaps
I can find more later, but one off the top of my head is:
(Additional work was done to avoid recomputing expensive functions that could have been avoided if the implementation of sort_by_key did not recompute the key). (There are other similar related examples in rustc too.)
I think this is one occurrence, but it is definitely not the prevalent (for example, looking through the Rust codebase gives other examples.) Regardless of which patterns necessitate non-trivial keys, the fact that they are frequently used indicates an improvement would be welcome. |
Hmm, that's odd. The
Thank you, having such examples is really helpful!
Agreed - we should do something about this. However, I'm afraid that the current state of the PR will make some people happy and others sad - there are improvements in some cases and regressions in others. I've often wondered if we should extend |
@stjepang No slice functionality in itertools -- having separate slicetools would be lovely. |
It seemed to have been a one-off — following up on that after spotting it, the variance was much lower, with around the same average.
I'm not sure then! It does seem slightly odd.
Yes, I think you're right. I think, intrinsically, it won't be possible to improve every case and in general, although the delta isn't large, I do expect that the majority of trivial key cases would regress a small amount if the current implementation is replaced. If we don't want this to happen, then perhaps a new method is the best way forward. In theory, we could have a clippy lint that suggests using the "key-cached" method for occurrences of Scanning through a GitHub search for |
78f2a06
to
bdcc6f9
Compare
I don't know which PR will get in first, but there's a perfect situation for using this in https://github.com/rust-lang/rust/pull/49196/files#diff-c16751a917081c702c37a8959be40f50R1440 |
@scottmcm: I was thinking of going through rustc, finding good candidates for the cached sort, as a follow-up 👍 |
Is this at a suitable point to start a FCP for the method? |
Given that the method is unstable I don't think an FCP is needed at this point. |
Ping from triage @bluss! The author pushed new commits, could you review them? |
src/liballoc/slice.rs
Outdated
/// | ||
/// For expensive key functions (e.g. functions that are not simple property accesses or | ||
/// basic operations), [`sort_by_cached_key`](#method.sort_by_cached_key) is likely to be | ||
/// significantly faster, as it does not recompute element keys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good pointer to make but I think we are normally cautious and omit this; otherwise we have the docs for a stable method recommending an experimental method (for the next few releases).
This is how we handled sort_unstable_by
; the mention of it in sort_by
's doc first showed up in Rust 1.20 when it went stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, that's a reasonable decision. I'll get rid of those comments for now.
I've removed the mention of |
@bors r+ rollup thanks! |
📌 Commit 9c7b69e has been approved by |
Add slice::sort_by_cached_key as a memoised sort_by_key At present, `slice::sort_by_key` calls its key function twice for each comparison that is made. When the key function is expensive (which can often be the case when `sort_by_key` is chosen over `sort_by`), this can lead to very suboptimal behaviour. To address this, I've introduced a new slice method, `sort_by_cached_key`, which has identical semantic behaviour to `sort_by_key`, except that it guarantees the key function will only be called once per element. Where there are `n` elements and the key function is `O(m)`: - `slice::sort_by_cached_key` time complexity is `O(m n log m n)`, compared to `slice::sort_by_key`'s `O(m n + n log n)`. - `slice::sort_by_cached_key` space complexity remains at `O(n + m)`. (Technically, it now reserves a slice of size `n`, whereas before it reserved a slice of size `n/2`.) `slice::sort_unstable_by_key` has not been given an analogue, as it is important that unstable sorts are in-place, which is not a property that is guaranteed here. However, this also means that `slice::sort_unstable_by_key` is likely to be slower than `slice::sort_by_cached_key` when the key function does not have negligible complexity. We might want to explore this trade-off further in the future. Benchmarks (for a vector of 100 `i32`s): ``` # Lexicographic: `|x| x.to_string()` test bench_sort_by_key ... bench: 112,638 ns/iter (+/- 19,563) test bench_sort_by_cached_key ... bench: 15,038 ns/iter (+/- 4,814) # Identity: `|x| *x` test bench_sort_by_key ... bench: 1,346 ns/iter (+/- 238) test bench_sort_by_cached_key ... bench: 1,839 ns/iter (+/- 765) # Power: `|x| x.pow(31)` test bench_sort_by_key ... bench: 3,624 ns/iter (+/- 738) test bench_sort_by_cached_key ... bench: 1,997 ns/iter (+/- 311) # Abs: `|x| x.abs()` test bench_sort_by_key ... bench: 1,546 ns/iter (+/- 174) test bench_sort_by_cached_key ... bench: 1,668 ns/iter (+/- 790) ``` (So it seems functions that are single operations do perform slightly worse with this method, but for pretty much any more complex key, you're better off with this optimisation.) I've definitely found myself using expensive keys in the past and wishing this optimisation was made (e.g. for rust-lang#47415). This feels like both desirable and expected behaviour, at the small cost of slightly more stack allocation and minute degradation in performance for extremely trivial keys. Resolves rust-lang#34447.
…n, r=scottmcm Use sort_by_cached_key where appropriate A follow-up to rust-lang#48639, converting various slice sorting calls to `sort_by_cached_key` when the key functions are more expensive.
…ey, r=SimonSapin Stabilize slice_sort_by_cached_key I was going to ask on the tracking issue (rust-lang#34447), but decided to just send this and hope for an FCP here. The method was added last March by rust-lang#48639. Signature: https://doc.rust-lang.org/std/primitive.slice.html#method.sort_by_cached_key ```rust impl [T] { pub fn sort_by_cached_key<K, F>(&mut self, f: F) where F: FnMut(&T) -> K, K: Ord; } ``` That's an identical signature to the existing `sort_by_key`, so I think the questions are just naming, implementation, and the usual "do we want this?". The implementation seems to have proven its use in rustc at least, which many uses: https://github.com/rust-lang/rust/search?l=Rust&q=sort_by_cached_key (I'm asking because it's exactly what I just needed the other day: ```rust all_positions.sort_by_cached_key(|&n| data::CITIES.iter() .map(|x| *metric_closure.get_edge(n, x.pos).unwrap()) .sum::<usize>() ); ``` since caching that key is a pretty obviously good idea.) Closes rust-lang#34447
At present,
slice::sort_by_key
calls its key function twice for each comparison that is made. When the key function is expensive (which can often be the case whensort_by_key
is chosen oversort_by
), this can lead to very suboptimal behaviour.To address this, I've introduced a new slice method,
sort_by_cached_key
, which has identical semantic behaviour tosort_by_key
, except that it guarantees the key function will only be called once per element.Where there are
n
elements and the key function isO(m)
:slice::sort_by_cached_key
time complexity isO(m n log m n)
, compared toslice::sort_by_key
'sO(m n + n log n)
.slice::sort_by_cached_key
space complexity remains atO(n + m)
. (Technically, it now reserves a slice of sizen
, whereas before it reserved a slice of sizen/2
.)slice::sort_unstable_by_key
has not been given an analogue, as it is important that unstable sorts are in-place, which is not a property that is guaranteed here. However, this also means thatslice::sort_unstable_by_key
is likely to be slower thanslice::sort_by_cached_key
when the key function does not have negligible complexity. We might want to explore this trade-off further in the future.Benchmarks (for a vector of 100
i32
s):(So it seems functions that are single operations do perform slightly worse with this method, but for pretty much any more complex key, you're better off with this optimisation.)
I've definitely found myself using expensive keys in the past and wishing this optimisation was made (e.g. for #47415). This feels like both desirable and expected behaviour, at the small cost of slightly more stack allocation and minute degradation in performance for extremely trivial keys.
Resolves #34447.