-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add helper method for taking the k smallest elements in an iterator #473
Conversation
I started adding another test, but I'll keep figuring out how to generate arbitrarish iterators for another day. |
.into_sorted_vec() | ||
.into_iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to use into_iter_sorted
, too bad that's unstable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that sounds like something I can push for: rust-lang/rust#76234
/// itertools::assert_equal(five_smallest, 0..5); | ||
/// ``` | ||
#[cfg(feature = "use_std")] | ||
fn k_smallest(self, k: usize) -> VecIntoIter<Self::Item> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's itertools
policy but maybe if the return type was newtype'd as KSmallestIter
or something, it would allow changing it into the future to something more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I'll address that tomorrow <3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to mention, but that doesn't seem feasible while keeping the guarantee that .collect::<Vec<_>>
is “free” (in-place, O(1) time). The stdlib's vec iterator can do it, but does so using specialisation (which isn't stabilised yet)
/// less than k elements, the result is equivalent to `self.sorted()`. | ||
/// | ||
/// This is guaranteed to use `k * sizeof(Self::Item) + O(1)` memory | ||
/// and `O(n log k)` time, with `n` the number of elements in the input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this is not what I was expecting this to be. Instead, I was expecting a wrapper around partition_at_index
so that it'd be O(n)
space & time.
What's implemented is interesting, though -- the lower memory use and O(n log k)
is an interesting alternative to the O(n + k log n)
of BinaryHeap::from_iter(it).into_sorted_iter().take(k)
-- but maybe take some time to figure out a way to communicate these differences in the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottmcm O(n)
time is impossible in the general case. Otherwise, it.k_smallest(n)
would sort a sequence of length n
in linear time.
Re: BinaryHeap::from_iter(it).into_sorted_iter().take(k)
, wouldn't this have strictly-worse performance characteristics, regarding both time and space? I need to think some more about it once caffeinated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's absolutely possible if this only needs to find the k smallest elements and not also sort those elements -- the fact that this is also sorting those elements wasn't obvious from the title. (And I would say, the extra k log k
of sorting them is unfortunate if someone just wants the smallest k and doesn't need them sorted. It would be nice to be able to turn them back into a vec::IntoIter
for the people who would be fine with heap order, like .k_smallest(10).sum()
.)
So there are plenty of options here:
Algorithm | space | time |
---|---|---|
nth element (unsorted) | n | n |
partial sort | n | n + k log k |
full heap | n | n + k log n |
k-size heap | k | n log k |
(Which does emphasize that BinaryHeap::into_iter_sorted
should only be used for not-known-ahead-of-time k
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's absolutely possible if this only needs to find the k smallest elements and not also sort those
Agreed; that just seemed like a worse API to be exposing, as:
- an undefined output order, which happens to be often sorted or almost-sorted (like heap order), can be confusing for the user (I myself thought for a moment that
BinaryHeap::into_vec()
sorted the result, and was surprised by that, as the examples I was testing with happened to produce a sorted result) ; - sorting the end result doesn't change the asymptotic complexity, for any heap-based implementation.
However, if we expose something like a fixed-size heap — and I agree it's a good idea, if only for the .extend()
type of use-cases you suggested — users who do not require a sorted result could use that directly.
So there are plenty of options here:
You seem to be including partition-based algorithms — AFAIK, the only way to get better time complexity than O(n log k) — and those do not work directly for iterators; in principle, it's always possible to first collect everything into a Vec
, use a partition-based algorithm, then truncate the Vec
down to size k
, but in practice I would expect this to be much slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be careful if we really want sorted output, as this forces users to pay runtime for sorting - even if they don't need it.
If API discoverability is an issue, we could still have smallest_k_sorted
vs. smallest_k_unsorted
as a last resort, clarifying the difference.
I was thinking about this some more, and since the logic in here is so cool, I was wondering if it would make sense to expose it in more ways. Notably, it seems like if there were a type for the logic, it could implement Dunno if that's something that'd be appropriate for |
@scottmcm Do you think I should move making a fixed-size heap wrapper to a separate PR, and make this one depend on it? |
@nbraud I don't know -- there's not clear precedent of exposing non-iterator types of itertools. So maybe it's best to just leave it as you have it until an itertools maintainer comment. |
Consistency with `Iterator::take` is the rationale for returning less than `k` elements when the input is too short.
Instantiate the test over all unsigned integer types
By limiting k and m to be u16, we use at most 2¹⁶ sizeof(u64) = 512 kiB for each allocated array and heap (at most 1MiB total for k_smallest_range)
See rust-lang/rust#75974 Co-authored-by: nicoo <[email protected]>
Ping? I addressed all feedback a week ago (but I'm not sure Github notifies on new pushes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! Sorry for the delay.
bors r+
Build succeeded: |
No description provided.