-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for enabling multiple CGUs in release mode by default #45320
Comments
Presumably this will remain a stable compiler flag, but what else will change by default? Is the plan to make this only affect |
I would specifically propose that any opt level greater than 1 uses 16 codegen units and ThinLTO enabled by default for those 16 codegen units. |
This is very interesting. One would think that ThinLTO-driven inlining should always have to do less work when our much more conservative pre-LLVM inlining. But that doesn't always seem to be the case, as in the rust-doom crate. Most tests on the irlo thread seem to profit though. Overall it's still not a clear picture to me. |
@michaelwoerister was that comment meant for a different thread? I forget which one as well though, so I'll respond here! As a "quick benchmark" I compiled the regex test suite with 16 CGUs + ThinLTO and then toggled inlining in all CGUs on/off. Surprisingly inlining in all CGUs was 5s faster to compile, and additionally did better on tons of benchmarks. In those timings |
@alexcrichton Yeah, the comment was in response to #45188 (rust-doom) but since that was closed already, I put it here. Regarding the compilation time difference between pre- and post-trans inlining, my hypothesis would be that sometimes pre-trans inlining will lead to more code being eliminated early on (as in the case of regex apparently) and sometimes it will have the opposite effect. I suspect that there's room for improvement by tuning which LLVM passes we run specifically for ThinLTO. Or do we do that already? |
Heh it's true yeah, I'd imagine that there's always room for improvement in pass tuning in Rust :). Right now we perform (afaik) 0 customization of any pass manager in LLVM. All of the normal optimization passes, LTO optimization passes, and ThinLTO optimization passes are all the same as what's in LLVM itself. |
I wanted to also take the time and tabulate all the results from the call to action thread to make sure it's all visibile in one place. Note that all the timings below are comparing a release build to a release build with 16 CGUs and ThinLTO enabled Improvements All of the following compile times improved, sorted by most improved to least improved
Regressions The following crates regressed in compile times, sorted from smallest regression to largest
alexcrichton's attempt to reproduce the regressions Here I attempt to reproduce the regressions on my own machine with
Unfortunately the only regression I was able to reproduce was the rust-belt regression. I'll be looking more into that. |
Looking into Almost all of the benefit of multiple codegen units is spreading out the work across all CPUs. Enabling ThinLTO to avoid losing any perf is fundamentally doing more work than what's already happening at What's happening here is that we have one huge CGU, but when we split it up we still have one huge CGU. This means that the CGU which may take ~1% less time in LLVM only gives us a tiny sliver of a window to run ThinLTO passes. In the case of the So generalizing even further, I think that enabling ThinLTO and multiple CGUs by default is going to regress compile time performance in any crate where our paritioning implementation doesn't actually partition very well. In the case of I would be willing to conclude, however, that such a situation is likely quite rare. Almost all other crates in the ecosystem will benefit from the partitioning which should basically evenly split up a crate. @michaelwoerister maybe you have thoughts on this though? |
Thank you so much for collecting and analyzing such a large amount of data, @alexcrichton! Your conclusions make sense to me. Unevenly distributed CGU sizes are also a problem for incremental compilation; e.g. the tokio-webpush-simple@030-minor-change test sees >90% CGU re-use, yet compile time is almost the same as from-scratch. In the non-incremental case it might be an option to detect the problematic case right after partitioning and then switch to non-LTO mode? I'm not sure it's worth the extra complexity though. For cases where there is one big CGU but that CGU contains multiple functions, we should be able to rather easily redistribute functions to other CGUs. We'd only need a metric for the size of a In conclusion, judging from the table above, I think we should make
|
FWIW, as the author of |
This seems like a good thing to have in our back pocket, but I'm also wary of trying to do this by default. For example the
Agreed! So far all the crate's I've seen the O(instructions) has been quite a good metric for "how long this takes in all LLVM-related passes", so counting MIR sounds reasonable to me as well. Again though I also feel like this is ok to have in our back pocket. I'd want dig more into the
Quite surprisingly I've yet to see any data that it's beneficial to compile time in release mode or doesn't hurt runtime. (all data is that it hurts both compile time and runtime performance!) That being said we're seeing such big wins from ThinLTO on big projects today that we can probably just save this as a possible optimization for the future. On the topic of inline functions I recently dug up #10212 again (later closed in favor of #14527, but the latter may no longer be true today). I suspect that may "fix" quite a bit of the compile time issue here without coming at a loss of performance? In any case possibly lots of interesting things we could do there.
Also a good question! I find this to be a difficult one, however, because the CGU number will affect the output artifact, which means that if we do this sort of probing it'll be beneficial for performance but come at the cost of more difficult deterministic builds. You'd have to specify the CGUs manually or build on the same-ish hardware to get a deterministic build I think? I think though that if you have 1 CPU then this is universaly a huge regression. With one CPU we'd split the preexisting one huge CGU into N different ones, probably take roughly the same amount of time to opimize those, and then tack on ThinLTO and more optimization passes. My guess is that with one CPU we'd easily see 50% regressions. With 2+ CPUs however I'd probably expect to see benefits to compile time. Anything giving us twice the resources to churn the cgus more quickly should quickly start seeing wins in theory I think. For now though the deterministic builds wins me over in terms of leaving this as-is for all hardware configurations. That and I doubt anyone's compiling Rust on single-core machines nowadays! |
One thing I think that's also worth pointing out is that up to this point we've mostly been measuring the runtime of an entire The benefit of ThinLTO and multiple CGUs is leveraging otherwise idle parallelism on the build machine. It's overall increasing the amount of work the compiler does. For a Put another way, ThinLTO and multiple CGUs should only be beneficial for builds which dont have many crates compiling in parallel for long parts of the build. If a build is 100% parallel for the entire time then ThinLTO will likely regress compile time performance. Now you might realize, however, that one very common case where you're only building one crate is in an incremental build! Typically if you do an incremental build you're only building a handful of crates, often serially. In that sense I think that there's some massive wins of ThinLTO + multiple CGUs in incremental builds rather than entire crate builds. Although improving both is of course great as well :) |
For deterministic builds you have to do some extra configuration anyway (e.g. path remapping) so I would not consider that a blocker. And I guess there are single core VMs around somewhere. But all of this is such a niche case that I don't really care
MIR-only RLIBs should put us into a pretty good spot regarding this, so I'm quite confident that we're on the right path overall. |
That's really interesting. For incremental compilation enabled the situation might be different (because pre-trans inlining hurts re-use) but for the non-incremental case it sounds like it's pretty clear what to do. |
Yeah, maybe we should revisit this at some point. Although I have to say the current solution of only |
Hm yeah that's a good point about needing configuration anyway for deterministic builds. It now seems like a more plausible route to take! Also that's a very interesting apoint about incremental and inlining on our own end... Maybe we should dig more into those ThinLTO runtime regressions at some point! Also yeah I don't really want to change how we trans inline functions just yet, I do like the simplicity too :) |
@alexcrichton out of curiosity how did you figure out this and which function it was. I've wanted to look into why the webrender build is so slow and would welcome tips. |
@jrmuizel oh sure I'd love to explain! So I originally found
That drops The graph here isn't always the easiest to read, but we've clearly got two huge bars, both of which correspond to taking a huge amount of time for that one CGU (first is optimization, second is ThinLTO + codegen). Next I ran:
and that command dumps a bunch of IR files into Inside that file it was 70k lines and some poking around showed that one function was 66k lines of IR. I sort of forget now how at this point I went from that IR to determining there was a huge function in there though... |
This commit enables ThinLTO for the compiler as well as multiple codegen units. This is intended to get the benefits of parallel codegen while also avoiding any major loss of perf. Finally this commit is also intended as further testing for rust-lang#45320 and shaking out bugs.
rustbuild: Compile rustc with ThinLTO This commit enables ThinLTO for the compiler as well as multiple codegen units. This is intended to get the benefits of parallel codegen while also avoiding any major loss of perf. Finally this commit is also intended as further testing for #45320 and shaking out bugs.
rustbuild: Compile rustc with ThinLTO This commit enables ThinLTO for the compiler as well as multiple codegen units. This is intended to get the benefits of parallel codegen while also avoiding any major loss of perf. Finally this commit is also intended as further testing for #45320 and shaking out bugs.
This commit moves the standard library to get compiled with multiple codegen units and ThinLTO like the compiler itself. This I would hope is the last major step towards closing out rust-lang#45320
This commit moves the standard library to get compiled with multiple codegen units and ThinLTO like the compiler itself. This I would hope is the last major step towards closing out rust-lang#45320
This commit is the next attempt to enable multiple codegen units by default in release mode, getting some of those sweet, sweet parallelism wins by running codegen in parallel. Performance should not be lost due to ThinLTO being on by default as well. Closes rust-lang#45320
…haelwoerister rustc: Set release mode cgus to 16 by default This commit is the next attempt to enable multiple codegen units by default in release mode, getting some of those sweet, sweet parallelism wins by running codegen in parallel. Performance should not be lost due to ThinLTO being on by default as well. Closes rust-lang#45320
…haelwoerister rustc: Set release mode cgus to 16 by default This commit is the next attempt to enable multiple codegen units by default in release mode, getting some of those sweet, sweet parallelism wins by running codegen in parallel. Performance should not be lost due to ThinLTO being on by default as well. Closes rust-lang#45320
This commit is the next attempt to enable multiple codegen units by default in release mode, getting some of those sweet, sweet parallelism wins by running codegen in parallel. Performance should not be lost due to ThinLTO being on by default as well. Closes rust-lang#45320
I don't know if this is the right place, but I think it should be considered whether the symbol issues reported in the Nightly section here: #46552 should be considered a blocker or not. see also discussion in commit comment here ;) The tl;dr is that certain versions of llvm, (i've seen this on 3.8, and whatever llvm rustc nightly uses) appends seemingly random garbage to the end of some names, e.g. we get:
instead of:
This knocks the debuginfo out of sync (it doesn't have the garbage appended). I've been able to repro this with clang3.8 for c++ files as well (haven't tested on other llvm), and switching to 5.0 seems to have fixed the issue. I don't know if its within reach, but perhaps we should attempt upgrading to llvm 5.0 before releasing this on stable? Note I understand this is for release mode, but i see this in debug mode on rustc nightly right now as well... |
rustc: Set release mode cgus to 16 by default This commit is the next attempt to enable multiple codegen units by default in release mode, getting some of those sweet, sweet parallelism wins by running codegen in parallel. Performance should not be lost due to ThinLTO being on by default as well. Closes #45320
I'm opening this up to serve as a tracking issue for enabling multiple codegen units in release mode by default. I've written up a lengthy summary before but the tl;dr; is that multiple codegen units enables us to run optimization/code generation in parallel, making use of all available computing resources often speeding up compilations by more than 2x.
Historically this has not been done due to claims of a loss in performance, but the recently implemented ThinLTO is intended to assuage such concerns. The most viable route forward seems to be to enable multiple CGUs and ThinLTO at the same time in release mode.
Performance summary
Blocking issues:
ThinLTO exposes too many symbols- fixedfirst attempt-blocked on presumed LLVM bug-Rust tracking issue-fixedblocked on test failures-current presumed cause of test failures-next attemptPossible build-time regressions using multiple CGUs in debug mode- couldn't reproduceReported build time regression in rust-doom- couldn't reproduceThinLTO broken some MSVC rlibs- fixedPotential blockers/bugs:
proposed fix-update to rustThe text was updated successfully, but these errors were encountered: