rustc_codegen_ssa: tune codegen according to available concurrency #82127

tgnottingham · 2021-02-15T02:20:18Z

This change tunes ahead-of-time codegening according to the amount of
concurrency available, rather than according to the number of CPUs on
the system. This can lower memory usage by reducing the number of
compiled LLVM modules in memory at once, particularly across several
rustc instances.

Previously, each rustc instance would assume that it should codegen
ahead of time to meet the demand of number-of-CPUs workers. But often, a
rustc instance doesn't have nearly that much concurrency available to
it, because the concurrency availability is split, via the jobserver,
across all active rustc instances spawned by the driving cargo process,
and is further limited by the -j flag argument. Therefore, each rustc
might have had several times the number of LLVM modules in memory than
it really needed to meet demand. If the modules were large, the effect
on memory usage would be noticeable.

With this change, the required amount of ahead-of-time codegen scales up
with the actual number of workers running within a rustc instance. Note
that the number of workers running can be less than the actual
concurrency available to a rustc instance. However, if more concurrency
is actually available, workers are spun up quickly as job tokens are
acquired, and the ahead-of-time codegen scales up quickly as well.

rust-highfive · 2021-02-15T02:20:21Z

r? @varkor

(rust-highfive has picked a reviewer for you, use r? to override)

tgnottingham · 2021-02-15T02:42:44Z

This change mostly targets concurrent rustc scenarios (i.e. cargo building multiple crates at once) or any scenario where -j is used with a number less than the number of CPUs. I'm not sure these scenarios apply to rustc-perf benchmarks, so we may not see a change there.

But here are some graphs of manual benchmark results.

This shows the memory usage using the current queue_full_enough heuristic versus the proposed one when compiling the entire stage0 rustc with -j equal to the number of CPUs (eight in this case). The reduction in peak memory usage is ~1GB, or 16%, accounting for the base memory usage outside of compilation. On higher CPU systems, the results might differ, though I'd expect them to be more dramatic.

This shows how some additional heuristics perform. The line labels indicate when the heuristic considers the queue to be full enough for the main thread to decide to stop codegenning and start LLVMing (see code comments for more detail than you probably want).

Notice that the Items >= 1 heuristic performs better than the proposed one during part of the compilation, but not all of it. However, I wouldn't be comfortable using it, as it could more easily increase compile times.

This shows the memory usage of the current heuristic versus the proposed one when compiling rustc_middle with -j1 and thin local LTO on. Technically, it's simulated -j1 via -Z no-parallel-llvm because I wasn't using x.py to compile, but they should behave the same or similarly enough. The reduction in peak memory usage is ~250MB, or 10%. This is meant to show the potential improvement when using a -j value less than the number of CPUs on the system. Remember that rustc_middle is kind of an extreme example though.

This shows the same benchmark but with LTO off. The reduction in peak memory usage is ~600MB, or 22%.

One thing I learned from this is how little an LLVM module seems to take up in memory after it's been serialized into an in-memory buffer (ThinBuffer) and the module proper dropped. I expected enabling thin LTO to cause a huge increase in memory usage because all of the modules have to be kept in memory until LTO is done. But it appears the serialized forms are small enough that the impact is negligible.

Stats were gathered by polling system memory usage once per second via the free command on Linux. The machine was otherwise left alone during benchmarking. I'm sure there's a better way, but this did a passable job.

@rustbot label T-compiler A-codegen I-compilemem

tgnottingham · 2021-02-15T02:57:56Z

@bors try @rust-timer queue

rust-timer · 2021-02-15T02:57:57Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2021-02-15T02:58:08Z

⌛ Trying commit ed858772f2047fdb02db3328fef70e7cec4023a1 with merge b12cc6754498881db4edf807a09dd6d3c6cf4f15...

bors · 2021-02-15T03:48:00Z

☀️ Try build successful - checks-actions
Build commit: b12cc6754498881db4edf807a09dd6d3c6cf4f15 (b12cc6754498881db4edf807a09dd6d3c6cf4f15)

rust-timer · 2021-02-15T03:48:02Z

Queued b12cc6754498881db4edf807a09dd6d3c6cf4f15 with parent 9503ea1, future comparison URL.

rust-timer · 2021-02-15T06:15:58Z

Finished benchmarking try commit (b12cc6754498881db4edf807a09dd6d3c6cf4f15): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

This change tunes ahead-of-time codegening according to the amount of concurrency available, rather than according to the number of CPUs on the system. This can lower memory usage by reducing the number of compiled LLVM modules in memory at once, particularly across several rustc instances. Previously, each rustc instance would assume that it should codegen ahead of time to meet the demand of number-of-CPUs workers. But often, a rustc instance doesn't have nearly that much concurrency available to it, because the concurrency availability is split, via the jobserver, across all active rustc instances spawned by the driving cargo process, and is further limited by the `-j` flag argument. Therefore, each rustc might have had several times the number of LLVM modules in memory than it really needed to meet demand. If the modules were large, the effect on memory usage would be noticeable. With this change, the required amount of ahead-of-time codegen scales up with the actual number of workers running within a rustc instance. Note that the number of workers running can be less than the actual concurrency available to a rustc instance. However, if more concurrency is actually available, workers are spun up quickly as job tokens are acquired, and the ahead-of-time codegen scales up quickly as well.

tgnottingham · 2021-02-15T21:15:24Z

Added more to the massive comment. Wanted to note that keeping the queue full is not only beneficial for meeting demand of existing workers, but for meeting demand when we have a surge in available concurrency.

Apologies if the comment is too monstrous. It's hard for me to express the scheduling nuances succinctly. Or anything, really. :)

As expected, the rustc-perf results don't show any change in max-rss. There may be a small regression in compile times (0.6% on bootstrap).

tgnottingham · 2021-02-17T19:55:11Z

@varkor, were you already reviewing this, or would you prefer to have it off your plate? I know @michaelwoerister has worked in this area (I always seem to end up in your neck of the woods, @michaelwoerister!).

varkor · 2021-02-18T13:51:21Z

@tgnottingham: sorry, I've been quite busy this week and last. You will probably get a quicker review if you have another reviewer in mind already, but I'll try to get to it this weekend if not :)

tgnottingham · 2021-02-18T20:35:11Z

@varkor, no problem at all. :)

tgnottingham · 2021-02-22T21:29:50Z

I checked the effects of this change on compiling rustc, post #70951. On my system, it now reduces peak memory usage by 13%, down from 16%, pre #70951.

varkor · 2021-02-22T22:52:47Z

Thanks for the change, and the detailed analysis and comment! This all looks good to me!

@bors r+

bors · 2021-02-22T22:52:49Z

📌 Commit 5f243d3 has been approved by varkor

bors · 2021-02-23T14:38:55Z

⌛ Testing commit 5f243d3 with merge 0196107...

bors · 2021-02-23T17:24:27Z

☀️ Test successful - checks-actions
Approved by: varkor
Pushing 0196107 to master...

rust-highfive assigned varkor Feb 15, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 15, 2021

rustbot added A-codegen Area: Code generation I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 15, 2021

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 15, 2021

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 15, 2021

tgnottingham force-pushed the tune-ahead-of-time-codegen branch from ed85877 to 5f243d3 Compare February 15, 2021 21:03

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 22, 2021

bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 23, 2021

bors merged commit 0196107 into rust-lang:master Feb 23, 2021

rustbot added this to the 1.52.0 milestone Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rustc_codegen_ssa: tune codegen according to available concurrency #82127

rustc_codegen_ssa: tune codegen according to available concurrency #82127

tgnottingham commented Feb 15, 2021

rust-highfive commented Feb 15, 2021

tgnottingham commented Feb 15, 2021 •

edited

Loading

tgnottingham commented Feb 15, 2021

rust-timer commented Feb 15, 2021

bors commented Feb 15, 2021

bors commented Feb 15, 2021

rust-timer commented Feb 15, 2021

rust-timer commented Feb 15, 2021

tgnottingham commented Feb 15, 2021 •

edited

Loading

tgnottingham commented Feb 17, 2021

varkor commented Feb 18, 2021

tgnottingham commented Feb 18, 2021

tgnottingham commented Feb 22, 2021 •

edited

Loading

varkor commented Feb 22, 2021

bors commented Feb 22, 2021

bors commented Feb 23, 2021

bors commented Feb 23, 2021

rustc_codegen_ssa: tune codegen according to available concurrency #82127

rustc_codegen_ssa: tune codegen according to available concurrency #82127

Conversation

tgnottingham commented Feb 15, 2021

rust-highfive commented Feb 15, 2021

tgnottingham commented Feb 15, 2021 • edited Loading

tgnottingham commented Feb 15, 2021

rust-timer commented Feb 15, 2021

bors commented Feb 15, 2021

bors commented Feb 15, 2021

rust-timer commented Feb 15, 2021

rust-timer commented Feb 15, 2021

tgnottingham commented Feb 15, 2021 • edited Loading

tgnottingham commented Feb 17, 2021

varkor commented Feb 18, 2021

tgnottingham commented Feb 18, 2021

tgnottingham commented Feb 22, 2021 • edited Loading

varkor commented Feb 22, 2021

bors commented Feb 22, 2021

bors commented Feb 23, 2021

bors commented Feb 23, 2021

tgnottingham commented Feb 15, 2021 •

edited

Loading

tgnottingham commented Feb 15, 2021 •

edited

Loading

tgnottingham commented Feb 22, 2021 •

edited

Loading