-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a specialized version std::iter::Enumerate for TrustedLen #77822
Conversation
r? @shepmaster (rust_highfive has picked a reviewer for you, use r? to override) |
Benchmark for the function from the issue with the same implementation as this PR outside of It's around 17.6% faster on this function.
#![feature(test)]
#![feature(specialization)]
#![feature(trusted_len)]
#![feature(core_intrinsics)]
#![allow(soft_unstable)]
extern crate test;
pub fn test_1(x: &[u32], y: &[u32]) -> u32 {
let mut sum = 0;
let chunk_size = y.len();
for (c, y) in Enumerate::new(y.iter()) {
for chunk in x.chunks_exact(chunk_size) {
sum += chunk[c] + y;
}
}
sum
}
pub fn test_0(x: &[u32], y: &[u32]) -> u32 {
let mut sum = 0;
let chunk_size = y.len();
for (c, y) in y.iter().enumerate() {
for chunk in x.chunks_exact(chunk_size) {
sum += chunk[c] + y;
}
}
sum
}
struct Enumerate<I> {
iter: I,
count: usize,
len: usize,
}
impl<I: Iterator> Enumerate<I> {
#[inline]
fn new(iter: I) -> Self {
EnumerateImpl::new(iter)
}
}
impl<I> Iterator for Enumerate<I>
where
I: Iterator,
{
type Item = (usize, <I as Iterator>::Item);
#[inline]
fn next(&mut self) -> Option<(usize, <I as Iterator>::Item)> {
EnumerateImpl::next(self)
}
}
// Enumerate specialization trait
#[doc(hidden)]
trait EnumerateImpl<I> {
type Item;
fn new(iter: I) -> Self;
fn next(&mut self) -> Option<(usize, Self::Item)>;
}
impl<I> EnumerateImpl<I> for Enumerate<I>
where
I: Iterator,
{
type Item = I::Item;
#[inline]
default fn new(iter: I) -> Self {
Enumerate {
iter,
count: 0,
len: 0, // unused
}
}
#[inline]
default fn next(&mut self) -> Option<(usize, I::Item)> {
let a = self.iter.next()?;
let i = self.count;
// Possible undefined overflow.
self.count += 1;
Some((i, a))
}
}
impl<I> EnumerateImpl<I> for Enumerate<I>
where
I: std::iter::TrustedLen + ExactSizeIterator + Iterator,
{
#[inline]
fn new(iter: I) -> Self {
let len = iter.size_hint().0;
Enumerate {
iter,
count: 0,
len,
}
}
#[inline]
fn next(&mut self) -> Option<(usize, I::Item)> {
let a = self.iter.next()?;
unsafe { std::intrinsics::assume(self.count < self.len); }
let i = self.count;
// Possible undefined overflow.
self.count += 1;
Some((i, a))
}
}
#[cfg(test)]
mod tests {
use super::*;
use test::Bencher;
#[bench]
fn bench_test_0(b: &mut Bencher) {
let x = vec![0; 4 * 1024 * 1024];
let y = vec![0; 4 * 1024 * 1024];
b.iter(|| {
test::black_box(test_0(test::black_box(&x), test::black_box(&y)));
});
}
#[bench]
fn bench_test_1(b: &mut Bencher) {
let x = vec![0; 4 * 1024 * 1024];
let y = vec![0; 4 * 1024 * 1024];
b.iter(|| {
test::black_box(test_1(test::black_box(&x), test::black_box(&y)));
});
}
} |
54a6861
to
49f8381
Compare
5f1ebd5
to
cb9db68
Compare
cc @rust-lang/libs |
@pietroalbini Do we have a benchmark tool for such changes? |
We can see if this affects rustc's speed with perf, but otherwise I'm not aware of other benchmarking tooling. |
Is there anything I can do to help moving this forward? |
r? @m-ou-se |
☔ The latest upstream changes (presumably #79319) made this pull request unmergeable. Please resolve the merge conflicts. Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:
|
27d8cee
to
fb23034
Compare
Should be all good again. |
fb23034
to
14c07d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR, and sorry for the delay.
This change adds quite a lot of non-trivial code to fix a seemingly small optimization problem. In addition, it makes the Enumerate
struct bigger with a field that's often unused, which can be noticable in some situations.
I agree #75935 is a problem that should be fixed, but I'm a bit uncomfortable with a solution/workaround like this.
Do similar problems come up in other languages that use LLVM, and how is it handled there? Does LLVM maybe already have an open issue for a similar situation in e.g. C++ (where pointer pairs are also the common representation of ranges)?
// SAFETY: the caller must uphold the contract for | ||
// `Iterator::__iterator_get_unchecked`. | ||
let value = unsafe { try_get_unchecked(&mut self.iter, idx) }; | ||
let idx = Add::add(self.count, idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this use Add::add
instead of +
? If this is necessary, a comment explaining why would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how it was originally and is everywhere else in this file. I don't know why and was wondering the same, but kept it because there presumably was a reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Add::add
delays decision about overflow check mode to the crate that does the code generation, even if overflow checks are disabled in the current crate, which they usually are for the standard library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to #81721, we should be using #[rustc_inherit_overflow_checks]
with a regular +
operator here instead of the Add:add
trick.
Thanks for the review and no problem at all!
It doesn't actually add new code, it rather duplicates already existing code to insert Would it be more acceptable if I refactored this into a macro that generates both copies of the code and inserts the
Indeed, that's a potential problem. I wonder if this can be avoided in some clever way? Associated types don't play well with specialization I assume, so we can't have it map to However this is the same already for the
I'll try reproducing with with clang++ and iterators, doesn't seem too easy though as C++ doesn't have such a rich API with lots of pre-defined iterators. |
Indeed. Not yet, at least. :( The main reason I'm a bit reluctant to merge this, is that this is a compiler/llvm/optimization problem, so a workaround in the library like this should only be a solution if the root problem isn't going to be fixed any time soon instead. If this was just adding an |
Makes sense! I'd expect this to be a general problem in how llvm optimizes such code, but it seems like something that is much easier to hit in Rust code than in C++ (no implicit bounds checks by default, iterator chaining not common, ...). If I implement something like the I'll try to come up with a testcase that can be reported to llvm |
ping form triage |
@JohnCSimon No news yet, it's still on my list of things to do but will probably take a while. Doing a C++ testcase for clang++ will require writing quite a bit of C++ boilerplate :) |
Ok, I actually tried implementing basically the same as what the iterators would do in plain C without any abstraction noise like iterators on top and it behaves exactly the same. Both with clang and gcc. See https://godbolt.org/z/TrzW5h . The #include <assert.h>
#include <stdint.h>
#include <stdlib.h>
uint32_t foo1(const uint32_t *x, size_t x_len, const uint32_t *y, size_t y_len) {
uint32_t sum = 0;
size_t chunk_size = y_len;
const uint32_t *y_end = y + y_len;
size_t c = 0;
for (const uint32_t *y_iter = y; y_iter != y_end; y_iter++, c++) {
size_t chunks_len = x_len - (x_len % chunk_size);
for (const uint32_t *chunks_iter = x; chunks_len >= chunk_size; chunks_iter += chunks_len, chunks_len -= chunk_size) {
uint32_t val;
assert(c < chunk_size);
val = chunks_iter[c];
sum += val + *y_iter;
}
}
return sum;
}
uint32_t foo2(const uint32_t *x, size_t x_len, const uint32_t *y, size_t y_len) {
uint32_t sum = 0;
size_t chunk_size = y_len;
for (size_t c = 0; c < chunk_size; c++) {
uint32_t y_iter;
size_t chunks_len;
assert(c < y_len);
y_iter = y[c];
chunks_len = x_len - (x_len % chunk_size);
for (const uint32_t *chunks_iter = x; chunks_len >= chunk_size; chunks_iter += chunks_len, chunks_len -= chunk_size) {
uint32_t val;
assert(c < chunk_size);
val = chunks_iter[c];
sum += val + y_iter;
}
}
return sum;
} @m-ou-se Do you think this is worth reporting to LLVM in this form? Nobody would write such code in C, but that's basically what the Rust iterators do :) |
Well of course that can be simplified to just the following, which is more realistic C code. And as bonus, the same thing with C++ iterators on In theory all the assertions in this code should be optimized away, but none are. I'll report that to LLVM then. #include <assert.h>
#include <stdint.h>
#include <stdlib.h>
#include <vector>
void foo1(const uint32_t *y, size_t y_len) {
const uint32_t *y_end = y + y_len;
size_t c = 0;
for (const uint32_t *y_iter = y; y_iter != y_end; y_iter++, c++) {
assert(c < y_len);
}
assert(c == y_len);
}
void foo2(const std::vector<uint32_t>& y) {
size_t c = 0;
for (auto y_iter: y) {
assert(c < y.size());
c++;
}
assert(c == y.size());
}
void foo3(const std::vector<uint32_t>& y) {
size_t c = 0;
for (auto y_iter = y.cbegin(); y_iter != y.cend(); y_iter++, c++) {
assert(c < y.size());
}
assert(c == y.size());
} And the minimal version in Rust too pub fn foo(y: &[u32]) {
let mut x = 0;
for (c, _y) in y.iter().enumerate() {
assert!(c < y.len());
x = c;
}
assert!(x == y.len());
} |
Reported at https://bugs.llvm.org/show_bug.cgi?id=48965 . Should we close this one here then, or ...? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sdroege Nice! Thanks a lot for investigating and filing that report!
Should we close this one here then, or ...?
We could either wait for a future version of LLVM with that issue fixed to end up in rustc, or we could go ahead and merge this PR for now and remove this specialization again when it's no longer necessary. Merging this now means we don't have to wait, but it does make this part of the code a bit harder to maintain. I'd be fine with that, as long as we make sure to document (in a comment) what the specialization is for and when it can be removed (with a link to that llvm issue), to make sure this doesn't stick around forever. What do you prefer?
Well, I'd prefer if it was merged as I expect this to take quite a while to fix in LLVM. If it helps anything, I'd be happy to help with reviewing any PRs in the future in this and related code to keep the maintenance burden lower on your side. I'll update the PR accordingly now and leave the final decision to you. I'll also add another commit on top of the two open discussions above, they already apply to the old code and would be useful to document in the code. |
This allows to hint at the compiler via `intrinsics::assume()` that the returned index is smaller than the length of the underlying iterator and allows the compiler to optimize code better based on that. Fixes rust-lang#75935
… Enumerate implementation
…ems fits into an usize
14c07d8
to
49b8f87
Compare
Updated accordingly (and rebased). No functional changes other than constraining the specialization to |
// This could be bound to `TrustedLen + ExactSizeIterator` or `TrustedRandomAccess` to guarantee | ||
// that the number of elements fits into an `usize` and that the returned length is actually the | ||
// real length. `TrustedRandomAccess` was selected because specialization on `ExactSizeIterator` is | ||
// not possible (yet?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewjasper Could you comment on the possibilities here? Specializing on ExactSizeIterator
would be useful here, but is not accepted. I see #[rustc_specialization_trait]
and #[rustc_unsafe_specialization_marker]
being used in some places to work around this, but I don't know in which cases those can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that TrustedLen
is unsafe + unstable + rustc_unsafe_specialization_marker
it is more or less implied that any type that implements it is some iterator owned by the standard library. Which means there are no lifetime shenanigans that would make ExactSizeIterator
unsafe for min_specialization.
Using that justification a rustc_unsafe_specialization_marker
helper trait inheriting from both should be fine. Probably.
On the other hand TrustedRandomAccess
is extending its tendrils into ever more places, so soon it should be almost equivalent in power to TrustedLen + ExactSizeIterator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrustedRandomAccess
seems like it would always be more specific than TrustedLen+ExactSizeIterator
. Implementing the former on things like BTreeMap
would probably be not a good idea, while TrustedLen+ExactSizeIterator
would be easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then yeah, I think that should be possible. But it might be good to get a second opinion from someone who has better understanding of the min_specialization rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the progress that happened in the LLVM bug report and all the issues found recently involving TrustedRandomAccess
, maybe let's just wait this out until LLVM handles it correctly?
What do you think @m-ou-se ?
☔ The latest upstream changes (presumably #81732) made this pull request unmergeable. Please resolve the merge conflicts. |
Ping from triage: @sdroege Hi, according to the discussion above, i'd like to switch this pr to S-Blocked state, but that needs an actual issue for this pr to block on. Is it possible to create such an issue that describes exactly what issue needs to be solved before this pr become possible to land? |
@crlf0710 https://bugs.llvm.org/show_bug.cgi?id=48965 would need to be solved and Rust would have to be updated to an LLVM version including it. |
@sdroege Sure. (Thanks for working on this!) |
This allows to hint at the compiler via
intrinsics::assume()
that thereturned index is smaller than the length of the underlying iterator and
allows the compiler to optimize code better based on that.
Fixes #75935
A benchmark comes in a bit.