Avoid query cache sharding code in single-threaded mode #94084

Mark-Simulacrum · 2022-02-17T15:44:08Z

In non-parallel compilers, this is just adding needless overhead at compilation time (since there is only one shard statically anyway). This amounts to roughly ~10 seconds reduction in bootstrap time, with overall neutral (some wins, some losses) performance results.

Parallel compiler performance should be largely unaffected by this PR; sharding is kept there.

Mark-Simulacrum · 2022-02-17T15:44:16Z

@bors try @rust-timer queue

rust-timer · 2022-02-17T15:44:18Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-02-17T15:44:24Z

⌛ Trying commit e25a77dca5e5cbfb78d927a9541661428d87331c with merge 471ea6ab86e550c13a729833d90e362bbc7d9622...

bors · 2022-02-17T17:17:16Z

☀️ Try build successful - checks-actions
Build commit: 471ea6ab86e550c13a729833d90e362bbc7d9622 (471ea6ab86e550c13a729833d90e362bbc7d9622)

rust-timer · 2022-02-17T17:17:17Z

Queued 471ea6ab86e550c13a729833d90e362bbc7d9622 with parent 30b3f35, future comparison URL.

compiler/rustc_query_system/src/query/plumbing.rs

joshtriplett · 2022-02-17T20:09:56Z

We absolutely should make changes that improve the performance of the non-parallel compiler. However, it'd be nice if we had some means of measuring the impact on the parallel compiler, to evaluate the need for separate code paths. Is there any tracking issue for having rustc-perf measure parallel compiler performance?

Mark-Simulacrum · 2022-02-17T20:21:08Z

The expectation is not necessarily to land this code as-is (I think that's unlikely), but to identify how much of a win this is -- and that will help calibrate the investment into the various next steps -- e.g., (a) keeping parallel compilation equivalent with additional cfg work; (b) not bothering at all with this patch.

If we do see a large enough improvement, benchmarking parallel compilers locally is possible -- just time consuming, since you need to build from scratch on master and with your changes (requiring a good 30-60 minutes minimum each, typically) and then run at least a subset of perf through that. That could help with the evaluation.

Tracking parallel compiler performance is not currently done, and I'm not aware of an issue for it. This is primarily no one is really actively working on that mode, so spending time investing into infra to track them does not seem particularly worthwhile -- it would require essentially doubling our costs (number of metrics, servers, etc.) which seems pretty extreme for a feature with essentially zero active development.

joshtriplett · 2022-02-17T20:29:46Z

The expectation is not necessarily to land this code as-is (I think that's unlikely), but to identify how much of a win this is

Ah, got it. In that case, any objections to marking this PR as a draft? That often serves as a good indicator of "this is being used to check performance of an idea".

I absolutely agree that we shouldn't run that tracking in general on every perf run until we have more active development on it. But it'd help to have the ability to enable it, and to be able to run it specifically for PRs we'd expect to affect it. (As well as, perhaps, a perf run per release.)

Mark-Simulacrum · 2022-02-17T20:38:42Z

The lack of an assigned reviewer (i.e., explicit r? @ghost) is my signal for whether work is not intended for review -- I don't typically use the draft view on GitHub, I don't really care either way though.

Tracking it even irregularly still requires quite a bit of work to get all the pieces in the right order today, but it's not something necessarily blocked on infra work (try builds with the right CI changes are sufficient), so someone well-motivated could start doing so.

Mark-Simulacrum · 2022-02-17T20:40:27Z

FWIW, one reason I am reluctant to track is that we already cannot really reliably keep up with triaging numerous, typically relatively small, perf regressions. I suspect that the parallel compiler mode will be even more difficult to diagnose regressions in -- at least in the current suite of tools -- so I am reluctant to add that extra data to our perf-tracking work.

joshtriplett · 2022-02-17T20:41:30Z

Ah, sorry, missed the r? @ghost.

Good to know that it's something someone could put together if motivated to do so.

FWIW, one reason I am reluctant to track is that we already cannot really reliably keep up with triaging numerous, typically relatively small, perf regressions. I suspect that the parallel compiler mode will be even more difficult to diagnose regressions in -- at least in the current suite of tools -- so I am reluctant to add that extra data to our perf-tracking work.

I absolutely wouldn't expect the perf team to handle diagnosing or dealing with such regressions; the only time we'd want to consider them is when making changes like this that may trade off optimization of one for the other.

rust-timer · 2022-02-17T22:18:49Z

Finished benchmarking commit (471ea6ab86e550c13a729833d90e362bbc7d9622): comparison url.

Summary: This benchmark run shows 15 relevant improvements 🎉 but 23 relevant regressions 😿 to instruction counts.

Average relevant regression: 1.0%
Average relevant improvement: -1.2%
Largest improvement in instruction counts: -3.0% on incr-patched: add static arr item builds of coercions debug
Largest regression in instruction counts: 1.9% on full builds of inflate check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Mark-Simulacrum · 2022-02-17T22:56:41Z

Looks like bootstrap data is not actually getting properly sorted -- rust-lang/rustc-perf#1175 should fix that -- but overall this shaves a good 2% of bootstrap times, 15 seconds, with largely neutral overall effect on perf (the regressions here do not seem big, likely to be optimizer noise, given the patch, and there are some improvements of roughly equal magnitude).

The win seems significant enough to be worth spending some time on bringing this from prototype to actually landing it -- I'm not sure how to best do that yet. cc @cjgillot @rust-lang/wg-incr-comp since the primarily thread through incremental code

My initial thinking is we can either just land this (pretty much as-is, modulo some further comment cleanup / renaming struct variables, etc.) or try to cfg gate all the sharding away. Given the relatively small size of this PR, there's probably not too much trouble with the cfg approach, but it would definitely require some plumbing and look pretty unfortunate I suspect.

@joshtriplett's point on parallel compiler performance is likely worth taking into account too, I can try to gather some statistics there but it'll be a bit of a pain for sure; if we choose the 'just cfg all the relevant bits', then that can mean skipping the parallel evaluation, perhaps

klensy · 2022-02-17T23:10:42Z

I've done some work in #93787 about separating parallel_compiler, but it need review.

Mark-Simulacrum · 2022-02-17T23:13:25Z

I think this would be largely orthogonal from that PR (or increase the work needed to separate it out), since it's about moving an API from being largely equivalent across the question of parallel compiler to being quite different.

nnethercote · 2022-02-18T22:07:11Z

My initial thinking is we can either just land this (pretty much as-is, modulo some further comment cleanup / renaming struct variables, etc.) or try to cfg gate all the sharding away.

The former would almost certainly regress the parallel compiler, right? Given that all the shards would be locked where currently a single shard is locked.

Mark-Simulacrum · 2022-02-18T22:56:21Z

Presuming there's heavy contention on a given query -- yes. It's worth noting that each individual query still has its own lock, so if threads are doing parallel work and largely executing distinct queries, then contention is probably minimal. We don't really hold the locks themselves for that long, either. #61779 added the sharding based on what looks like ~1 data point (at least documented), though the 30% win there is certainly significant. On the other hand, IIRC 1st-gen Ryzen was particularly bad at latency when shuffling cache lines between cores, so I'm not sure how much of a win this ends up being.

I think it's pretty likely that we could fairly minimally adjust the PR to have Sharded<T> be just Lock<T> on non-parallel compilers and [Lock<T>; N] on parallel compilers, with functions appropriately mapping to each use case around that. I can try to experiment with that and make a concrete proposal (i.e., delta on this PR), though I admit that a good part of me sort of wants to just not bother with the cfgs across rustc_query crates necessary to make that happen.

bjorn3 · 2022-02-20T15:12:07Z

Doesn't non-parallel rustc already use a single shard?

rust/compiler/rustc_data_structures/src/sharded.rs

Lines 18 to 19 in 3b18651

    
           #[cfg(not(parallel_compiler))] 
        
           const SHARD_BITS: usize = 0;

Mark-Simulacrum · 2022-02-20T15:44:53Z

Yes, it does. As the perf results illustrate, the layers of abstraction there do add a fairly considerable chunk of compilation time, though runtime performance is largely unaffected.

I'm working on a revision of this patch that tries to aim to cfg the sharding more carefully alongside some cleanups so we keep equivalent results on parallel builds.

Mark-Simulacrum · 2022-02-20T20:11:48Z

@bors try @rust-timer queue

Alright, pushed up a new set of commits which do a more thorough refactoring split across multiple commits and keep largely identical high-level behavior for parallel compilation (modulo a few mostly minor details around QueryLookup keeping shard indices and such, rather than recomputing them from scratch. Those are hard to cfg pipe around and do not feel likely to be meaningful to me).

rust-timer · 2022-02-20T20:11:49Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-02-20T20:11:56Z

⌛ Trying commit 594ea74 with merge 18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3...

bors · 2022-02-20T21:45:53Z

☀️ Try build successful - checks-actions
Build commit: 18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3 (18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3)

rust-timer · 2022-02-20T21:45:54Z

Queued 18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3 with parent c1aa854, future comparison URL.

rust-timer · 2022-02-21T00:15:54Z

Finished benchmarking commit (18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3): comparison url.

Summary: This benchmark run shows 26 relevant improvements 🎉 but 21 relevant regressions 😿 to instruction counts.

Average relevant regression: 1.4%
Average relevant improvement: -0.5%
Largest improvement in instruction counts: -1.1% on incr-unchanged builds of ctfe-stress-4 check
Largest regression in instruction counts: 2.5% on incr-full builds of deeply-nested-async check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Mark-Simulacrum · 2022-02-21T00:22:59Z

Results look pretty mixed but I think overall neutral -- stress tests dominate regressions and there are also some small improvements; looking at cachegrind diffs locally it doesn't look like they're obviously related to the work in this PR, so I am marking the regression as triaged (rather inlining noise and similar).

cjgillot · 2022-02-24T21:32:20Z

I'm wondering: is there a longer-term context for these changes? This PR optimizes the serial compiler and degrades the parallel compiler. Should we consider dropping/reimplementing the parallel compiler?

Mark-Simulacrum · 2022-02-24T21:35:26Z

The latest version of this PR should have roughly neutral effect on parallel compiler performance, since it keeps sharding things in that mode.

IMO, it may not be a bad idea to drop the parallel compiler support unless we have concrete investment expected in the next 6-18 month timeframe, since it does cause constant 'small' pain across many bits of the compiler. But this PR would ideally not be blocked on a decision there :)

lqd · 2022-02-25T11:49:49Z

Should we consider dropping/reimplementing the parallel compiler?

The parallel compiler currently suffers from non-deterministic ICEs (in addition to the other known issues about jobserver pipe contention, lack of horizontal scalability, etc) but when/if it works, it seems to be surprisingly effective on compile times.

cjgillot · 2022-02-27T12:37:14Z

LGTM.
Why did you remove the caching of the key hash using the QueryLookup type? Could we gain a bit of perf by avoiding to hash keys multiple times?
r=me either way

Mark-Simulacrum · 2022-02-27T13:26:53Z

It was already unused -- if you look at the commit deleting QueryLookup, we're not actually threading it down anywhere. The query shard was used, but recomputing it from the hash we calculate anyway should be pretty cheap (it's just a shift and mask) and threading it through just on parallel compilers seems like more work than we ought to do.

It's possible actually caching the key hash would make sense but I think we see a little benefit from saving some registers/stack allocation to thread the data down too, so it's not guaranteed to help. I expect the query key hash is typically really fast to compute, as the majority of our keys are e.g. DefId or so, which take just a handful of instructions to compute FxHash for.

@bors r=cjgillot

bors · 2022-02-27T13:26:55Z

📌 Commit 594ea74 has been approved by cjgillot

bors · 2022-02-27T14:04:10Z

⌛ Testing commit 594ea74 with merge 3b1fe7e...

bors · 2022-02-27T16:25:18Z

☀️ Test successful - checks-actions
Approved by: cjgillot
Pushing 3b1fe7e to master...

rust-timer · 2022-02-27T17:55:05Z

Finished benchmarking commit (3b1fe7e): comparison url.

Summary: This benchmark run shows 55 relevant improvements 🎉 to instruction counts.

Arithmetic mean of relevant regressions: 1.1%
Arithmetic mean of relevant improvements: -0.8%
Arithmetic mean of all relevant changes: -0.6%
Largest improvement in instruction counts: -2.3% on full builds of keccak check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 17, 2022

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 17, 2022

joshtriplett reviewed Feb 17, 2022

View reviewed changes

compiler/rustc_query_system/src/query/plumbing.rs Outdated Show resolved Hide resolved

rustbot added perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 17, 2022

cjgillot self-assigned this Feb 18, 2022

Mark-Simulacrum mentioned this pull request Feb 18, 2022

Use if SHARDS == 1 in a couple more places. #94111

Closed

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 20, 2022

Mark-Simulacrum force-pushed the drop-sharded branch from 6aa3210 to 594ea74 Compare February 20, 2022 20:10

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 20, 2022

Mark-Simulacrum added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 20, 2022

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2022

Mark-Simulacrum added the perf-regression-triaged The performance regression has been triaged. label Feb 21, 2022

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 27, 2022

bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 27, 2022

bors merged commit 3b1fe7e into rust-lang:master Feb 27, 2022

rustbot added this to the 1.61.0 milestone Feb 27, 2022

rustbot removed the perf-regression Performance regression. label Feb 27, 2022

Mark-Simulacrum deleted the drop-sharded branch February 27, 2022 21:14

Avoid query cache sharding code in single-threaded mode #94084

Avoid query cache sharding code in single-threaded mode #94084

Conversation

Mark-Simulacrum commented Feb 17, 2022 • edited Loading

Mark-Simulacrum commented Feb 17, 2022

rust-timer commented Feb 17, 2022

bors commented Feb 17, 2022

bors commented Feb 17, 2022

rust-timer commented Feb 17, 2022

joshtriplett commented Feb 17, 2022 • edited Loading

Mark-Simulacrum commented Feb 17, 2022

joshtriplett commented Feb 17, 2022

Mark-Simulacrum commented Feb 17, 2022

Mark-Simulacrum commented Feb 17, 2022

joshtriplett commented Feb 17, 2022

rust-timer commented Feb 17, 2022

Mark-Simulacrum commented Feb 17, 2022

klensy commented Feb 17, 2022

Mark-Simulacrum commented Feb 17, 2022

nnethercote commented Feb 18, 2022

Mark-Simulacrum commented Feb 18, 2022

bjorn3 commented Feb 20, 2022

Mark-Simulacrum commented Feb 20, 2022

Mark-Simulacrum commented Feb 20, 2022

rust-timer commented Feb 20, 2022

bors commented Feb 20, 2022

bors commented Feb 20, 2022

rust-timer commented Feb 20, 2022

rust-timer commented Feb 21, 2022

Mark-Simulacrum commented Feb 21, 2022 • edited Loading

cjgillot commented Feb 24, 2022

Mark-Simulacrum commented Feb 24, 2022

lqd commented Feb 25, 2022

cjgillot commented Feb 27, 2022

Mark-Simulacrum commented Feb 27, 2022

bors commented Feb 27, 2022

bors commented Feb 27, 2022

bors commented Feb 27, 2022

rust-timer commented Feb 27, 2022

Mark-Simulacrum commented Feb 17, 2022 •

edited

Loading

joshtriplett commented Feb 17, 2022 •

edited

Loading

Mark-Simulacrum commented Feb 21, 2022 •

edited

Loading