Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a specialized version std::iter::Enumerate for TrustedLen #77822

Closed
wants to merge 3 commits into from

Conversation

sdroege
Copy link
Contributor

@sdroege sdroege commented Oct 11, 2020

This allows to hint at the compiler via intrinsics::assume() that the
returned index is smaller than the length of the underlying iterator and
allows the compiler to optimize code better based on that.

Fixes #75935


A benchmark comes in a bit.

@rust-highfive
Copy link
Collaborator

r? @shepmaster

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 11, 2020
@sdroege
Copy link
Contributor Author

sdroege commented Oct 11, 2020

Benchmark for the function from the issue with the same implementation as this PR outside of std.

It's around 17.6% faster on this function.

test tests::bench_test_0 ... bench:   3,517,733 ns/iter (+/- 585,186)
test tests::bench_test_1 ... bench:   2,991,140 ns/iter (+/- 239,283)
#![feature(test)]
#![feature(specialization)]
#![feature(trusted_len)]
#![feature(core_intrinsics)]

#![allow(soft_unstable)]

extern crate test;

pub fn test_1(x: &[u32], y: &[u32]) -> u32 {
    let mut sum = 0;
    let chunk_size = y.len();
    for (c, y) in Enumerate::new(y.iter()) {
        for chunk in x.chunks_exact(chunk_size) {
            sum += chunk[c] + y;
        }
    }
    sum
}

pub fn test_0(x: &[u32], y: &[u32]) -> u32 {
    let mut sum = 0;
    let chunk_size = y.len();
    for (c, y) in y.iter().enumerate() {
        for chunk in x.chunks_exact(chunk_size) {
            sum += chunk[c] + y;
        }
    }
    sum
}

struct Enumerate<I> {
    iter: I,
    count: usize,
    len: usize,
}

impl<I: Iterator> Enumerate<I> {
    #[inline]
    fn new(iter: I) -> Self {
        EnumerateImpl::new(iter)
    }
}

impl<I> Iterator for Enumerate<I>
where
    I: Iterator,
{
    type Item = (usize, <I as Iterator>::Item);

    #[inline]
    fn next(&mut self) -> Option<(usize, <I as Iterator>::Item)> {
        EnumerateImpl::next(self)
    }
}

// Enumerate specialization trait
#[doc(hidden)]
trait EnumerateImpl<I> {
    type Item;
    fn new(iter: I) -> Self;
    fn next(&mut self) -> Option<(usize, Self::Item)>;
}

impl<I> EnumerateImpl<I> for Enumerate<I>
where
    I: Iterator,
{
    type Item = I::Item;

    #[inline]
    default fn new(iter: I) -> Self {
        Enumerate {
            iter,
            count: 0,
            len: 0, // unused
        }
    }

    #[inline]
    default fn next(&mut self) -> Option<(usize, I::Item)> {
        let a = self.iter.next()?;
        let i = self.count;
        // Possible undefined overflow.
        self.count += 1;
        Some((i, a))
    }
}

impl<I> EnumerateImpl<I> for Enumerate<I>
where
    I: std::iter::TrustedLen + ExactSizeIterator + Iterator,
{
    #[inline]
    fn new(iter: I) -> Self {
        let len = iter.size_hint().0;

        Enumerate {
            iter,
            count: 0,
            len,
        }
    }

    #[inline]
    fn next(&mut self) -> Option<(usize, I::Item)> {
        let a = self.iter.next()?;
        unsafe { std::intrinsics::assume(self.count < self.len); }
        let i = self.count;
        // Possible undefined overflow.
        self.count += 1;
        Some((i, a))
    }
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn bench_test_0(b: &mut Bencher) {
        let x = vec![0; 4 * 1024 * 1024];
        let y = vec![0; 4 * 1024 * 1024];
        b.iter(|| {
            test::black_box(test_0(test::black_box(&x), test::black_box(&y)));
        });
    }

    #[bench]
    fn bench_test_1(b: &mut Bencher) {
        let x = vec![0; 4 * 1024 * 1024];
        let y = vec![0; 4 * 1024 * 1024];
        b.iter(|| {
            test::black_box(test_1(test::black_box(&x), test::black_box(&y)));
        });
    }
}

@sdroege sdroege force-pushed the enumerate-specialization branch from 54a6861 to 49f8381 Compare October 11, 2020 16:27
@sdroege sdroege force-pushed the enumerate-specialization branch 2 times, most recently from 5f1ebd5 to cb9db68 Compare October 11, 2020 16:55
@jyn514 jyn514 added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 11, 2020
@GuillaumeGomez
Copy link
Member

cc @rust-lang/libs

@GuillaumeGomez
Copy link
Member

@pietroalbini Do we have a benchmark tool for such changes?

@pietroalbini
Copy link
Member

We can see if this affects rustc's speed with perf, but otherwise I'm not aware of other benchmarking tooling.

@sdroege
Copy link
Contributor Author

sdroege commented Oct 29, 2020

Is there anything I can do to help moving this forward?

@shepmaster
Copy link
Member

r? @m-ou-se

@bors
Copy link
Contributor

bors commented Nov 23, 2020

☔ The latest upstream changes (presumably #79319) made this pull request unmergeable. Please resolve the merge conflicts.

Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:

@rustbot modify labels: +S-waiting-on-review -S-waiting-on-author

@sdroege sdroege force-pushed the enumerate-specialization branch 2 times, most recently from 27d8cee to fb23034 Compare November 23, 2020 10:45
@sdroege
Copy link
Contributor Author

sdroege commented Nov 23, 2020

umbrella The latest upstream changes (presumably #79319) made this pull request unmergeable. Please resolve the merge conflicts.

Should be all good again.

@sdroege sdroege force-pushed the enumerate-specialization branch from fb23034 to 14c07d8 Compare November 23, 2020 11:02
@crlf0710 crlf0710 added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 11, 2020
Copy link
Member

@m-ou-se m-ou-se left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR, and sorry for the delay.

This change adds quite a lot of non-trivial code to fix a seemingly small optimization problem. In addition, it makes the Enumerate struct bigger with a field that's often unused, which can be noticable in some situations.

I agree #75935 is a problem that should be fixed, but I'm a bit uncomfortable with a solution/workaround like this.

Do similar problems come up in other languages that use LLVM, and how is it handled there? Does LLVM maybe already have an open issue for a similar situation in e.g. C++ (where pointer pairs are also the common representation of ranges)?

library/core/src/iter/adapters/enumerate.rs Show resolved Hide resolved
// SAFETY: the caller must uphold the contract for
// `Iterator::__iterator_get_unchecked`.
let value = unsafe { try_get_unchecked(&mut self.iter, idx) };
let idx = Add::add(self.count, idx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this use Add::add instead of +? If this is necessary, a comment explaining why would be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how it was originally and is everywhere else in this file. I don't know why and was wondering the same, but kept it because there presumably was a reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Add::add delays decision about overflow check mode to the crate that does the code generation, even if overflow checks are disabled in the current crate, which they usually are for the standard library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to #81721, we should be using #[rustc_inherit_overflow_checks] with a regular + operator here instead of the Add:add trick.

@m-ou-se m-ou-se added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 30, 2020
@sdroege
Copy link
Contributor Author

sdroege commented Dec 31, 2020

Thanks for your PR, and sorry for the delay.

Thanks for the review and no problem at all!

This change adds quite a lot of non-trivial code to fix a seemingly small optimization problem.

It doesn't actually add new code, it rather duplicates already existing code to insert intrinsics::assume() in strategic places. Arguably worse :)

Would it be more acceptable if I refactored this into a macro that generates both copies of the code and inserts the assume() in one of the two instantiations of the macro?

In addition, it makes the Enumerate struct bigger with a field that's often unused, which can be noticable in some situations.

Indeed, that's a potential problem. I wonder if this can be avoided in some clever way? Associated types don't play well with specialization I assume, so we can't have it map to () in the general case and bool in the specialized case?

However this is the same already for the Zip implementations. The generic one keeps an unused boolean around that is only used by the TrustedRandomAccess-specialized one.

Do similar problems come up in other languages that use LLVM, and how is it handled there? Does LLVM maybe already have an open issue for a similar situation in e.g. C++ (where pointer pairs are also the common representation of ranges)?

I'll try reproducing with with clang++ and iterators, doesn't seem too easy though as C++ doesn't have such a rich API with lots of pre-defined iterators.

@m-ou-se
Copy link
Member

m-ou-se commented Dec 31, 2020

Associated types don't play well with specialization I assume

Indeed. Not yet, at least. :(


The main reason I'm a bit reluctant to merge this, is that this is a compiler/llvm/optimization problem, so a workaround in the library like this should only be a solution if the root problem isn't going to be fixed any time soon instead. If this was just adding an assume() line or two, it wouldn't be much of a problem. But specializations like this make it significantly harder to maintain this code, and make it much easier to add mistakes in the future. I do realize we already have plenty of specializations like this for performance reasons, but I'd first like to understand if this might already be fixed soon by a compiler (llvm? mir? ..?) patch, before we reach for a workaround like this.

@sdroege
Copy link
Contributor Author

sdroege commented Dec 31, 2020

Makes sense!

I'd expect this to be a general problem in how llvm optimizes such code, but it seems like something that is much easier to hit in Rust code than in C++ (no implicit bounds checks by default, iterator chaining not common, ...). If I implement something like the chunks_exact() and enumerate() plus bounds-checking array accessors in C++ I'm 99% sure it would behave the same, but probably no C++ programmer would write such code in practice.

I'll try to come up with a testcase that can be reported to llvm

@JohnCSimon
Copy link
Member

ping form triage
@sdroege - can you please post the status of this PR? thank you.

@sdroege
Copy link
Contributor Author

sdroege commented Jan 19, 2021

@sdroege - can you please post the status of this PR? thank you.

@JohnCSimon No news yet, it's still on my list of things to do but will probably take a while. Doing a C++ testcase for clang++ will require writing quite a bit of C++ boilerplate :)

@sdroege
Copy link
Contributor Author

sdroege commented Jan 30, 2021

Ok, I actually tried implementing basically the same as what the iterators would do in plain C without any abstraction noise like iterators on top and it behaves exactly the same. Both with clang and gcc.

See https://godbolt.org/z/TrzW5h .

The assert() in foo1() stays, both assert()s in foo2() are optimized away. Adding a __builtin_assume(c < y_len); at the top of the outer loop's body (which is basically what this PR here does) allows the assertion to get optimized away.

#include <assert.h>
#include <stdint.h>
#include <stdlib.h>

uint32_t foo1(const uint32_t *x, size_t x_len, const uint32_t *y, size_t y_len) {
  uint32_t sum = 0;
  size_t chunk_size = y_len;

  const uint32_t *y_end = y + y_len;
  size_t c = 0;
  for (const uint32_t *y_iter = y; y_iter != y_end; y_iter++, c++) {
    size_t chunks_len = x_len - (x_len % chunk_size);

    for (const uint32_t *chunks_iter = x; chunks_len >= chunk_size; chunks_iter += chunks_len, chunks_len -= chunk_size) {
      uint32_t val;

      assert(c < chunk_size);
      val = chunks_iter[c];

      sum += val + *y_iter;
    }
  }

  return sum;
}

uint32_t foo2(const uint32_t *x, size_t x_len, const uint32_t *y, size_t y_len) {
  uint32_t sum = 0;
  size_t chunk_size = y_len;

  for (size_t c = 0; c < chunk_size; c++) {
    uint32_t y_iter;
    size_t chunks_len;

    assert(c < y_len);
    y_iter = y[c];
    
    chunks_len  = x_len - (x_len % chunk_size);
    for (const uint32_t *chunks_iter = x; chunks_len >= chunk_size; chunks_iter += chunks_len, chunks_len -= chunk_size) {
      uint32_t val;

      assert(c < chunk_size);
      val = chunks_iter[c];

      sum += val + y_iter;
    }
  }

  return sum;
}

@m-ou-se Do you think this is worth reporting to LLVM in this form? Nobody would write such code in C, but that's basically what the Rust iterators do :)

@sdroege
Copy link
Contributor Author

sdroege commented Jan 30, 2021

Well of course that can be simplified to just the following, which is more realistic C code. And as bonus, the same thing with C++ iterators on std::vector.

In theory all the assertions in this code should be optimized away, but none are. I'll report that to LLVM then.

#include <assert.h>
#include <stdint.h>
#include <stdlib.h>

#include <vector>

void foo1(const uint32_t *y, size_t y_len) {
  const uint32_t *y_end = y + y_len;
  size_t c = 0;
  for (const uint32_t *y_iter = y; y_iter != y_end; y_iter++, c++) {
    assert(c < y_len);
  }
  assert(c == y_len);
}

void foo2(const std::vector<uint32_t>& y) {
    size_t c = 0;
    for (auto y_iter: y) {
        assert(c < y.size());
        c++;
    }
    assert(c == y.size());
}

void foo3(const std::vector<uint32_t>& y) {
    size_t c = 0;
    for (auto y_iter = y.cbegin(); y_iter != y.cend(); y_iter++, c++) {
        assert(c < y.size());
    }
    assert(c == y.size());
}

And the minimal version in Rust too

pub fn foo(y: &[u32]) {
    let mut x = 0;
    for (c, _y) in y.iter().enumerate() {
        assert!(c < y.len());
        x = c;
    }
    assert!(x == y.len());
}

@sdroege
Copy link
Contributor Author

sdroege commented Jan 30, 2021

Reported at https://bugs.llvm.org/show_bug.cgi?id=48965 . Should we close this one here then, or ...?

Copy link
Member

@m-ou-se m-ou-se left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdroege Nice! Thanks a lot for investigating and filing that report!

Should we close this one here then, or ...?

We could either wait for a future version of LLVM with that issue fixed to end up in rustc, or we could go ahead and merge this PR for now and remove this specialization again when it's no longer necessary. Merging this now means we don't have to wait, but it does make this part of the code a bit harder to maintain. I'd be fine with that, as long as we make sure to document (in a comment) what the specialization is for and when it can be removed (with a link to that llvm issue), to make sure this doesn't stick around forever. What do you prefer?

library/core/src/iter/adapters/enumerate.rs Outdated Show resolved Hide resolved
@sdroege
Copy link
Contributor Author

sdroege commented Jan 30, 2021

What do you prefer?

Well, I'd prefer if it was merged as I expect this to take quite a while to fix in LLVM. If it helps anything, I'd be happy to help with reviewing any PRs in the future in this and related code to keep the maintenance burden lower on your side.

I'll update the PR accordingly now and leave the final decision to you. I'll also add another commit on top of the two open discussions above, they already apply to the old code and would be useful to document in the code.

This allows to hint at the compiler via `intrinsics::assume()` that the
returned index is smaller than the length of the underlying iterator and
allows the compiler to optimize code better based on that.

Fixes rust-lang#75935
@sdroege sdroege force-pushed the enumerate-specialization branch from 14c07d8 to 49b8f87 Compare January 30, 2021 16:40
@sdroege
Copy link
Contributor Author

sdroege commented Jan 30, 2021

Updated accordingly (and rebased). No functional changes other than constraining the specialization to TrustedRandomAccess and otherwise just various additional comments.

Comment on lines +108 to +111
// This could be bound to `TrustedLen + ExactSizeIterator` or `TrustedRandomAccess` to guarantee
// that the number of elements fits into an `usize` and that the returned length is actually the
// real length. `TrustedRandomAccess` was selected because specialization on `ExactSizeIterator` is
// not possible (yet?).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewjasper Could you comment on the possibilities here? Specializing on ExactSizeIterator would be useful here, but is not accepted. I see #[rustc_specialization_trait] and #[rustc_unsafe_specialization_marker] being used in some places to work around this, but I don't know in which cases those can be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that TrustedLen is unsafe + unstable + rustc_unsafe_specialization_marker it is more or less implied that any type that implements it is some iterator owned by the standard library. Which means there are no lifetime shenanigans that would make ExactSizeIterator unsafe for min_specialization.
Using that justification a rustc_unsafe_specialization_marker helper trait inheriting from both should be fine. Probably.

On the other hand TrustedRandomAccess is extending its tendrils into ever more places, so soon it should be almost equivalent in power to TrustedLen + ExactSizeIterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrustedRandomAccess seems like it would always be more specific than TrustedLen+ExactSizeIterator. Implementing the former on things like BTreeMap would probably be not a good idea, while TrustedLen+ExactSizeIterator would be easy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then yeah, I think that should be possible. But it might be good to get a second opinion from someone who has better understanding of the min_specialization rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the progress that happened in the LLVM bug report and all the issues found recently involving TrustedRandomAccess, maybe let's just wait this out until LLVM handles it correctly?

What do you think @m-ou-se ?

@bors
Copy link
Contributor

bors commented Feb 22, 2021

☔ The latest upstream changes (presumably #81732) made this pull request unmergeable. Please resolve the merge conflicts.

@crlf0710
Copy link
Member

Ping from triage: @sdroege Hi, according to the discussion above, i'd like to switch this pr to S-Blocked state, but that needs an actual issue for this pr to block on. Is it possible to create such an issue that describes exactly what issue needs to be solved before this pr become possible to land?

@sdroege
Copy link
Contributor Author

sdroege commented Mar 12, 2021

@crlf0710 https://bugs.llvm.org/show_bug.cgi?id=48965 would need to be solved and Rust would have to be updated to an LLVM version including it.

@m-ou-se
Copy link
Member

m-ou-se commented Mar 27, 2021

maybe let's just wait this out until LLVM handles it correctly?

@sdroege Sure. (Thanks for working on this!)

@m-ou-se m-ou-se closed this Mar 27, 2021
@sdroege sdroege deleted the enumerate-specialization branch May 6, 2021 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

slice::iter() does not preserve number of iterations information for optimizer causing unneeded bounds checks