-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add helper method for taking the k smallest elements in an iterator #473
Changes from all commits
46a739f
51670c9
e71977e
11b54cc
b0443e0
d292fa0
7aadb1e
036a451
900f1e8
f31b617
45e5a64
11b2058
f28ffd0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
use alloc::collections::BinaryHeap; | ||
use core::cmp::Ord; | ||
|
||
pub(crate) fn k_smallest<T: Ord, I: Iterator<Item = T>>(mut iter: I, k: usize) -> BinaryHeap<T> { | ||
if k == 0 { return BinaryHeap::new(); } | ||
|
||
let mut heap = iter.by_ref().take(k).collect::<BinaryHeap<_>>(); | ||
|
||
for i in iter { | ||
debug_assert_eq!(heap.len(), k); | ||
// Equivalent to heap.push(min(i, heap.pop())) but more efficient. | ||
// This should be done with a single `.peek_mut().unwrap()` but | ||
// `PeekMut` sifts-down unconditionally on Rust 1.46.0 and prior. | ||
if *heap.peek().unwrap() > i { | ||
*heap.peek_mut().unwrap() = i; | ||
} | ||
} | ||
|
||
heap | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -194,6 +194,8 @@ mod group_map; | |
mod groupbylazy; | ||
mod intersperse; | ||
#[cfg(feature = "use_alloc")] | ||
mod k_smallest; | ||
#[cfg(feature = "use_alloc")] | ||
mod kmerge_impl; | ||
#[cfg(feature = "use_alloc")] | ||
mod lazy_buffer; | ||
|
@@ -2284,6 +2286,43 @@ pub trait Itertools : Iterator { | |
v.into_iter() | ||
} | ||
|
||
/// Sort the k smallest elements into a new iterator, in ascending order. | ||
/// | ||
/// **Note:** This consumes the entire iterator, and returns the result | ||
/// as a new iterator that owns its elements. If the input contains | ||
/// less than k elements, the result is equivalent to `self.sorted()`. | ||
/// | ||
/// This is guaranteed to use `k * sizeof(Self::Item) + O(1)` memory | ||
/// and `O(n log k)` time, with `n` the number of elements in the input. | ||
/// | ||
/// The sorted iterator, if directly collected to a `Vec`, is converted | ||
/// without any extra copying or allocation cost. | ||
/// | ||
/// **Note:** This is functionally-equivalent to `self.sorted().take(k)` | ||
/// but much more efficient. | ||
/// | ||
/// ``` | ||
/// use itertools::Itertools; | ||
/// | ||
/// // A random permutation of 0..15 | ||
/// let numbers = vec![6, 9, 1, 14, 0, 4, 8, 7, 11, 2, 10, 3, 13, 12, 5]; | ||
/// | ||
/// let five_smallest = numbers | ||
/// .into_iter() | ||
/// .k_smallest(5); | ||
/// | ||
/// itertools::assert_equal(five_smallest, 0..5); | ||
/// ``` | ||
#[cfg(feature = "use_alloc")] | ||
fn k_smallest(self, k: usize) -> VecIntoIter<Self::Item> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a good point, I'll address that tomorrow <3 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I forgot to mention, but that doesn't seem feasible while keeping the guarantee that |
||
where Self: Sized, | ||
Self::Item: Ord | ||
{ | ||
crate::k_smallest::k_smallest(self, k) | ||
.into_sorted_vec() | ||
.into_iter() | ||
Comment on lines
+2322
to
+2323
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be cool to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well that sounds like something I can push for: rust-lang/rust#76234 |
||
} | ||
|
||
/// Collect all iterator elements into one of two | ||
/// partitions. Unlike `Iterator::partition`, each partition may | ||
/// have a distinct type. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this is not what I was expecting this to be. Instead, I was expecting a wrapper around
partition_at_index
so that it'd beO(n)
space & time.What's implemented is interesting, though -- the lower memory use and
O(n log k)
is an interesting alternative to theO(n + k log n)
ofBinaryHeap::from_iter(it).into_sorted_iter().take(k)
-- but maybe take some time to figure out a way to communicate these differences in the name.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottmcm
O(n)
time is impossible in the general case. Otherwise,it.k_smallest(n)
would sort a sequence of lengthn
in linear time.Re:
BinaryHeap::from_iter(it).into_sorted_iter().take(k)
, wouldn't this have strictly-worse performance characteristics, regarding both time and space? I need to think some more about it once caffeinated.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's absolutely possible if this only needs to find the k smallest elements and not also sort those elements -- the fact that this is also sorting those elements wasn't obvious from the title. (And I would say, the extra
k log k
of sorting them is unfortunate if someone just wants the smallest k and doesn't need them sorted. It would be nice to be able to turn them back into avec::IntoIter
for the people who would be fine with heap order, like.k_smallest(10).sum()
.)So there are plenty of options here:
(Which does emphasize that
BinaryHeap::into_iter_sorted
should only be used for not-known-ahead-of-timek
.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed; that just seemed like a worse API to be exposing, as:
BinaryHeap::into_vec()
sorted the result, and was surprised by that, as the examples I was testing with happened to produce a sorted result) ;However, if we expose something like a fixed-size heap — and I agree it's a good idea, if only for the
.extend()
type of use-cases you suggested — users who do not require a sorted result could use that directly.You seem to be including partition-based algorithms — AFAIK, the only way to get better time complexity than O(n log k) — and those do not work directly for iterators; in principle, it's always possible to first collect everything into a
Vec
, use a partition-based algorithm, then truncate theVec
down to sizek
, but in practice I would expect this to be much slower.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be careful if we really want sorted output, as this forces users to pay runtime for sorting - even if they don't need it.
If API discoverability is an issue, we could still have
smallest_k_sorted
vs.smallest_k_unsorted
as a last resort, clarifying the difference.