-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid query cache sharding code in single-threaded mode #94084
Conversation
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit e25a77dca5e5cbfb78d927a9541661428d87331c with merge 471ea6ab86e550c13a729833d90e362bbc7d9622... |
☀️ Try build successful - checks-actions |
Queued 471ea6ab86e550c13a729833d90e362bbc7d9622 with parent 30b3f35, future comparison URL. |
We absolutely should make changes that improve the performance of the non-parallel compiler. However, it'd be nice if we had some means of measuring the impact on the parallel compiler, to evaluate the need for separate code paths. Is there any tracking issue for having rustc-perf measure parallel compiler performance? |
The expectation is not necessarily to land this code as-is (I think that's unlikely), but to identify how much of a win this is -- and that will help calibrate the investment into the various next steps -- e.g., (a) keeping parallel compilation equivalent with additional cfg work; (b) not bothering at all with this patch. If we do see a large enough improvement, benchmarking parallel compilers locally is possible -- just time consuming, since you need to build from scratch on master and with your changes (requiring a good 30-60 minutes minimum each, typically) and then run at least a subset of perf through that. That could help with the evaluation. Tracking parallel compiler performance is not currently done, and I'm not aware of an issue for it. This is primarily no one is really actively working on that mode, so spending time investing into infra to track them does not seem particularly worthwhile -- it would require essentially doubling our costs (number of metrics, servers, etc.) which seems pretty extreme for a feature with essentially zero active development. |
Ah, got it. In that case, any objections to marking this PR as a draft? That often serves as a good indicator of "this is being used to check performance of an idea". I absolutely agree that we shouldn't run that tracking in general on every perf run until we have more active development on it. But it'd help to have the ability to enable it, and to be able to run it specifically for PRs we'd expect to affect it. (As well as, perhaps, a perf run per release.) |
The lack of an assigned reviewer (i.e., explicit r? @ghost) is my signal for whether work is not intended for review -- I don't typically use the draft view on GitHub, I don't really care either way though. Tracking it even irregularly still requires quite a bit of work to get all the pieces in the right order today, but it's not something necessarily blocked on infra work (try builds with the right CI changes are sufficient), so someone well-motivated could start doing so. |
FWIW, one reason I am reluctant to track is that we already cannot really reliably keep up with triaging numerous, typically relatively small, perf regressions. I suspect that the parallel compiler mode will be even more difficult to diagnose regressions in -- at least in the current suite of tools -- so I am reluctant to add that extra data to our perf-tracking work. |
Ah, sorry, missed the Good to know that it's something someone could put together if motivated to do so.
I absolutely wouldn't expect the perf team to handle diagnosing or dealing with such regressions; the only time we'd want to consider them is when making changes like this that may trade off optimization of one for the other. |
Finished benchmarking commit (471ea6ab86e550c13a729833d90e362bbc7d9622): comparison url. Summary: This benchmark run shows 15 relevant improvements 🎉 but 23 relevant regressions 😿 to instruction counts.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never |
Looks like bootstrap data is not actually getting properly sorted -- rust-lang/rustc-perf#1175 should fix that -- but overall this shaves a good 2% of bootstrap times, 15 seconds, with largely neutral overall effect on perf (the regressions here do not seem big, likely to be optimizer noise, given the patch, and there are some improvements of roughly equal magnitude). The win seems significant enough to be worth spending some time on bringing this from prototype to actually landing it -- I'm not sure how to best do that yet. cc @cjgillot @rust-lang/wg-incr-comp since the primarily thread through incremental code My initial thinking is we can either just land this (pretty much as-is, modulo some further comment cleanup / renaming struct variables, etc.) or try to cfg gate all the sharding away. Given the relatively small size of this PR, there's probably not too much trouble with the cfg approach, but it would definitely require some plumbing and look pretty unfortunate I suspect. @joshtriplett's point on parallel compiler performance is likely worth taking into account too, I can try to gather some statistics there but it'll be a bit of a pain for sure; if we choose the 'just cfg all the relevant bits', then that can mean skipping the parallel evaluation, perhaps |
I've done some work in #93787 about separating parallel_compiler, but it need review. |
I think this would be largely orthogonal from that PR (or increase the work needed to separate it out), since it's about moving an API from being largely equivalent across the question of parallel compiler to being quite different. |
The former would almost certainly regress the parallel compiler, right? Given that all the shards would be locked where currently a single shard is locked. |
Presuming there's heavy contention on a given query -- yes. It's worth noting that each individual query still has its own lock, so if threads are doing parallel work and largely executing distinct queries, then contention is probably minimal. We don't really hold the locks themselves for that long, either. #61779 added the sharding based on what looks like ~1 data point (at least documented), though the 30% win there is certainly significant. On the other hand, IIRC 1st-gen Ryzen was particularly bad at latency when shuffling cache lines between cores, so I'm not sure how much of a win this ends up being. I think it's pretty likely that we could fairly minimally adjust the PR to have |
Doesn't non-parallel rustc already use a single shard? rust/compiler/rustc_data_structures/src/sharded.rs Lines 18 to 19 in 3b18651
|
Yes, it does. As the perf results illustrate, the layers of abstraction there do add a fairly considerable chunk of compilation time, though runtime performance is largely unaffected. I'm working on a revision of this patch that tries to aim to cfg the sharding more carefully alongside some cleanups so we keep equivalent results on parallel builds. |
6aa3210
to
594ea74
Compare
@bors try @rust-timer queue Alright, pushed up a new set of commits which do a more thorough refactoring split across multiple commits and keep largely identical high-level behavior for parallel compilation (modulo a few mostly minor details around QueryLookup keeping shard indices and such, rather than recomputing them from scratch. Those are hard to cfg pipe around and do not feel likely to be meaningful to me). |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 594ea74 with merge 18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3... |
☀️ Try build successful - checks-actions |
Queued 18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3 with parent c1aa854, future comparison URL. |
Finished benchmarking commit (18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3): comparison url. Summary: This benchmark run shows 26 relevant improvements 🎉 but 21 relevant regressions 😿 to instruction counts.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never |
Results look pretty mixed but I think overall neutral -- stress tests dominate regressions and there are also some small improvements; looking at cachegrind diffs locally it doesn't look like they're obviously related to the work in this PR, so I am marking the regression as triaged (rather inlining noise and similar). |
I'm wondering: is there a longer-term context for these changes? This PR optimizes the serial compiler and degrades the parallel compiler. Should we consider dropping/reimplementing the parallel compiler? |
The latest version of this PR should have roughly neutral effect on parallel compiler performance, since it keeps sharding things in that mode. IMO, it may not be a bad idea to drop the parallel compiler support unless we have concrete investment expected in the next 6-18 month timeframe, since it does cause constant 'small' pain across many bits of the compiler. But this PR would ideally not be blocked on a decision there :) |
The parallel compiler currently suffers from non-deterministic ICEs (in addition to the other known issues about jobserver pipe contention, lack of horizontal scalability, etc) but when/if it works, it seems to be surprisingly effective on compile times. |
LGTM. |
It was already unused -- if you look at the commit deleting QueryLookup, we're not actually threading it down anywhere. The query shard was used, but recomputing it from the hash we calculate anyway should be pretty cheap (it's just a shift and mask) and threading it through just on parallel compilers seems like more work than we ought to do. It's possible actually caching the key hash would make sense but I think we see a little benefit from saving some registers/stack allocation to thread the data down too, so it's not guaranteed to help. I expect the query key hash is typically really fast to compute, as the majority of our keys are e.g. DefId or so, which take just a handful of instructions to compute FxHash for. @bors r=cjgillot |
📌 Commit 594ea74 has been approved by |
☀️ Test successful - checks-actions |
Finished benchmarking commit (3b1fe7e): comparison url. Summary: This benchmark run shows 55 relevant improvements 🎉 to instruction counts.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. @rustbot label: -perf-regression |
In non-parallel compilers, this is just adding needless overhead at compilation time (since there is only one shard statically anyway). This amounts to roughly ~10 seconds reduction in bootstrap time, with overall neutral (some wins, some losses) performance results.
Parallel compiler performance should be largely unaffected by this PR; sharding is kept there.