Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal debug codegen in rand #43299

Closed
alexcrichton opened this issue Jul 17, 2017 · 8 comments
Closed

Suboptimal debug codegen in rand #43299

alexcrichton opened this issue Jul 17, 2017 · 8 comments
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@alexcrichton
Copy link
Member

I found in one of my crates that when compiling in debug mode (cargo build) the largest function in the whole executable looked like it was this one. That looked pretty innocuous so I dug a little deeper. It turns out that function creates 1000 alloc instructions using this program as find.rs compiling this source:

$ rustc +nightly -V
rustc 1.20.0-nightly (b2c070787 2017-07-13)
$ rustc +nightly foo.rs --emit llvm-ir --crate-type lib -Z time-passes && ./find < ./foo.ll
1137 allocas

Trying to winnow it down further this source contains 337 allocas:

$ rustc +nightly foo.rs --emit llvm-ir --crate-type lib -Z time-passes && ./find < ./foo.ll
337 allocas

For comparison, the equivalent C++ program (I think? my C++ isn't very good) has only 86 allocas:

$ clang++ -emit-llvm -S foo.cpp -std=c++11 && ./find < ./foo.ll
86 allocas

Interestingly enough the C++ program compiles in ~0.035s whereas the Rust program compiles in ~0.2s, over 4x slower. The major passes in Rust 0.07s in translation, 0.03s in expansion, and 0.01s in item-bodies checking and LLVM passes. As to where the other 0.08s went in -Z time-passes I'm not sure!

@alexcrichton alexcrichton added the A-codegen Area: Code generation label Jul 17, 2017
@alexcrichton
Copy link
Member Author

Also it looks like if the inner block is repeated in place Clang doesn't compile any more allocas (it's always 86) whereas Rust's number of alloca instances will grow linearly.

@aturon aturon added the I-compiletime Issue: Problems and improvements with respect to compile times. label Jul 17, 2017
@kennytm
Copy link
Member

kennytm commented Jul 17, 2017

Changing a = a - e to a -= e (and all 24 operations) increases that to 433 allocas 😕

@oyvindln
Copy link
Contributor

oyvindln commented Jul 18, 2017

From a glance at the assmbly, I suspect Range::next() might be one of the issues here. (Though there are probably several things.) The function uses mem::swap to save the incremented counter value back to the iterator and return the previous value of the counter.

Now, mem::swap is simply a shim that calls ptr::swap_nonoverlapping which ultimately calls This massive function (ptr::swap_nonoverlapping_bytes). For larger types this is used to swap efficiently using SIMD, though for swapping two primitive numbers it's a bit overkill. Now this will get optimised out release mode, but in debug mode it results in a lot of redundant cruft. I presume swap is used since the range trait could theoretically be used on something that is not primitive numbers, though maybe the next function (and anything else currently using swap) should be specialised to avoid using swap for types where it's not needed. Even in release mode this seems like something that will create a lot of redundant work for the llvm optimiser.

EDIT: This doesn't seem to be the main source of the alloca statements inside the loops though, but it's probably still worth looking into.

@oyvindln
Copy link
Contributor

oyvindln commented Jul 20, 2017

So there are some differences between the cpp version you posted and the rust version.

Firstly there is the loop. In the C++ version, the loop is a loop over an array of 4 elements, rather than a range iterator as in Rust. See this. It looks as though it avoids using iterators entirely in this case (I don't see any calls to begin() end() in the output) which makes it a bit cleaner than the rust version which has to iterate by calling next().

Secondly, there are some differences in the functions on the wrapping type. In Rust the methods on Wrapping are marked #[inline(always)], so they get inlined even when not doing optimizations. In the rust version this results redundant extra temporary copies being created (e.g more alloc statements) for every invocation of these methods. (This also explains the massive increase when manually unrolling) In the cpp version, the temporaries are created inside the method call instead, which results in the IR having less alloc statements.

I made an altered cpp version that gives results that are closer to the rust version, by force-inlining the method calls and calling wrapping add/sub through a function as in rust: (I didn't bother altering the loop iteration as that would be a bit more complex.)

https://gist.github.com/oyvindln/cc65fc6c479d3347708be972798d49f7

EDIT: Fixed this (The other value in Wrapping::shr/shl is anded by this in the rust version also, but I'm not sure what value pub const u64: u32 = i64; results in). It looks like the value is masked twice with the same value in the rust code.

This gives me 286 allocas, which is a fair bit closer to the rust version. Interestingly, the wrapping_{add/sub} functions have two allocas in c++, and only one in rust. Maybe there is some basic processing that could be done during this inlining to avoid all the redundant copies here.

Another interesting observation is that both the rust and c++ versions treat the wrapping type as though it is an aggregate, and thus accesses the inner value by a pointer into the struct. Maybe #[repr(transparent)] could help here to let llvm know the struct could be treated as an u64 here and avoid having to use getelementptr to access the value.

@alexcrichton
Copy link
Member Author

Oh nice find on the #[inline(always)]!

I removed the annotations and the example dropped from 337 allocas to 129 allocas (yay!). Unfortunately though Rust still shows linear growth in terms of repetition while C++ still doesn't :(

@alexcrichton
Copy link
Member Author

I've opened #43367 to remove inline(always) annotations.

@oyvindln
Copy link
Contributor

oyvindln commented Jul 21, 2017

Removing the redundant masking operation in Wrapping:shl/shr (which calls wrapping_shl/shr which already masks anyhow) would probably also help avoid one alloca for each of those functions, at least it did when tested with a local copy of them (didn't try with the standard library itself).

@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 28, 2017
@jonas-schievink jonas-schievink added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jan 31, 2020
@Mark-Simulacrum
Copy link
Member

We appear to generate a constant 46 allocas today in debug on stable for this program, including when repeating the body of the loop many times. I think this is fixed. The original program (random generator) produces a total of 103 allocas without filtering as well, so I think we're doing much better in this space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants