-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rustc_codegen_ssa: tune codegen according to available concurrency #82127
rustc_codegen_ssa: tune codegen according to available concurrency #82127
Conversation
r? @varkor (rust-highfive has picked a reviewer for you, use r? to override) |
This change mostly targets concurrent rustc scenarios (i.e. cargo building multiple crates at once) or any scenario where But here are some graphs of manual benchmark results. This shows the memory usage using the current This shows how some additional heuristics perform. The line labels indicate when the heuristic considers the queue to be full enough for the main thread to decide to stop codegenning and start LLVMing (see code comments for more detail than you probably want). Notice that the This shows the memory usage of the current heuristic versus the proposed one when compiling rustc_middle with This shows the same benchmark but with LTO off. The reduction in peak memory usage is ~600MB, or 22%. One thing I learned from this is how little an LLVM module seems to take up in memory after it's been serialized into an in-memory buffer ( Stats were gathered by polling system memory usage once per second via the @rustbot label T-compiler A-codegen I-compilemem |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit ed858772f2047fdb02db3328fef70e7cec4023a1 with merge b12cc6754498881db4edf807a09dd6d3c6cf4f15... |
☀️ Try build successful - checks-actions |
Queued b12cc6754498881db4edf807a09dd6d3c6cf4f15 with parent 9503ea1, future comparison URL. |
Finished benchmarking try commit (b12cc6754498881db4edf807a09dd6d3c6cf4f15): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
This change tunes ahead-of-time codegening according to the amount of concurrency available, rather than according to the number of CPUs on the system. This can lower memory usage by reducing the number of compiled LLVM modules in memory at once, particularly across several rustc instances. Previously, each rustc instance would assume that it should codegen ahead of time to meet the demand of number-of-CPUs workers. But often, a rustc instance doesn't have nearly that much concurrency available to it, because the concurrency availability is split, via the jobserver, across all active rustc instances spawned by the driving cargo process, and is further limited by the `-j` flag argument. Therefore, each rustc might have had several times the number of LLVM modules in memory than it really needed to meet demand. If the modules were large, the effect on memory usage would be noticeable. With this change, the required amount of ahead-of-time codegen scales up with the actual number of workers running within a rustc instance. Note that the number of workers running can be less than the actual concurrency available to a rustc instance. However, if more concurrency is actually available, workers are spun up quickly as job tokens are acquired, and the ahead-of-time codegen scales up quickly as well.
ed85877
to
5f243d3
Compare
Added more to the massive comment. Wanted to note that keeping the queue full is not only beneficial for meeting demand of existing workers, but for meeting demand when we have a surge in available concurrency. Apologies if the comment is too monstrous. It's hard for me to express the scheduling nuances succinctly. Or anything, really. :) As expected, the rustc-perf results don't show any change in max-rss. There may be a small regression in compile times (0.6% on bootstrap). |
@varkor, were you already reviewing this, or would you prefer to have it off your plate? I know @michaelwoerister has worked in this area (I always seem to end up in your neck of the woods, @michaelwoerister!). |
@tgnottingham: sorry, I've been quite busy this week and last. You will probably get a quicker review if you have another reviewer in mind already, but I'll try to get to it this weekend if not :) |
@varkor, no problem at all. :) |
Thanks for the change, and the detailed analysis and comment! This all looks good to me! @bors r+ |
📌 Commit 5f243d3 has been approved by |
☀️ Test successful - checks-actions |
This change tunes ahead-of-time codegening according to the amount of
concurrency available, rather than according to the number of CPUs on
the system. This can lower memory usage by reducing the number of
compiled LLVM modules in memory at once, particularly across several
rustc instances.
Previously, each rustc instance would assume that it should codegen
ahead of time to meet the demand of number-of-CPUs workers. But often, a
rustc instance doesn't have nearly that much concurrency available to
it, because the concurrency availability is split, via the jobserver,
across all active rustc instances spawned by the driving cargo process,
and is further limited by the
-j
flag argument. Therefore, each rustcmight have had several times the number of LLVM modules in memory than
it really needed to meet demand. If the modules were large, the effect
on memory usage would be noticeable.
With this change, the required amount of ahead-of-time codegen scales up
with the actual number of workers running within a rustc instance. Note
that the number of workers running can be less than the actual
concurrency available to a rustc instance. However, if more concurrency
is actually available, workers are spun up quickly as job tokens are
acquired, and the ahead-of-time codegen scales up quickly as well.