-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a minimum CGU size in non-incremental builds. #112448
Conversation
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit d7903614aaba46d87a0288e6eea386edb08d4c30 with merge 01c64cbb288ae7a44bb43912291d2b7b6e47da7d... |
I got some |
This comment has been minimized.
This comment has been minimized.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (01c64cbb288ae7a44bb43912291d2b7b6e47da7d): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 649.387s -> 650.164s (0.12%) |
Hmm. This should be a slam dunk, based on logic and local measurements. The results here are great (or "no change") across the board for almost every metric (even the obscure ones) except walltime, where it's very mixed, with a few more regressions than improvements. Which is puzzling, because this really should be a clear win, especially for cases like Also, lots of changes for incremental cases, which shouldn't be happening because this doesn't change incremental compilation at all. Including lots of changes to binary size for incremental cases, which makes no sense. (In comparison, the "object file" measurements look as expected. What's the difference between binary size (a.k.a. "linked artifact") and "object file"?) |
This comment has been minimized.
This comment has been minimized.
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 45b08a53a870b118a159e65b710f0a8f3dfb1126 with merge 6532381723f53907d56ef7c7f21d2c3a77f7340a... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (6532381723f53907d56ef7c7f21d2c3a77f7340a): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 647.29s -> 647.896s (0.09%) |
@nnethercote Could the fact that the CGUs now get sorted by size even if there are less than the limit make a difference? |
I had a similar thought, but this runs shortly afterwards, sorting the CGUs by name. Also, I did some debug logging and the CGU contents were identical before and after. Even worse, I'm getting big discrepancies between instruction counts measurements from Cachegrind vs perf. It's all very puzzling, I'll have to investigate more next week. |
45b08a5
to
6b456f8
Compare
⌛ Trying commit 9a516e047a6016474be5811bb66323cb5370e7e5 with merge 244b1111491e0e0b819b9a9aedb433964d59c0dd... |
☀️ Try build successful - checks-actions |
1 similar comment
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (244b1111491e0e0b819b9a9aedb433964d59c0dd): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 649.134s -> 650.1s (0.15%) |
9a516e0
to
6f00c8d
Compare
split into. (The exact number will depend on the size and structure of the | ||
crate's source code.) It takes an integer greater than 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see the parenthetical statement being interpreted as "we may create more CGUs than the value you specified". Maybe this formulation is more clear on that point?
split into. (The exact number will depend on the size and structure of the | |
crate's source code.) It takes an integer greater than 0. | |
split into. (Fewer code generation units may be created depending on the | |
size and structure of the crate's source code.) It takes an integer | |
greater than 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove the parenthetical. The important part of the change is specifying that it's a maximum.
Another interesting result from your final commit, adding up all of the CGU sizes shows:
My guess is that this is due to fewer duplicated inline functions in different CGUs. This might be another statistic worth reporting on. |
For example, we go from this: ``` FINAL (4059 items, total_size=232342; 16 CGUs, max_size=39608, min_size=5468, max_size/min_size=7.2): - CGU[0] regex.f2ff11e98f8b05c7-cgu.0 (318 items, size=39608): - fn ... - fn ... ``` to this: ``` FINAL - unique items: 2726 (1459 root + 1267 inlined), unique size: 201214 (146046 root + 55168 inlined) - placed items: 4059 (1459 root + 2600 inlined), placed size: 232342 (146046 root + 86296 inlined) - placed/unique items ratio: 1.49, placed/unique size ratio: 1.15 - CGUs: 16, mean size: 14521.4, sizes: [39608, 31122, 20318, 20236, 16268, 13777, 12310, 10531, 10205, 9810, 9250, 9065 (x2), 7785, 7524, 5468] - CGU[0] - regex.f2ff11e98f8b05c7-cgu.0, size: 39608 - items: 318, mean size: 124.6, sizes: [28395, 3418, 558, 485, 259, 228, 176, 166, 146, 118, 117 (x3), 114 (x5), 113 (x3), 101, 84, 82, 77, 76, 72, 71 (x2), 66, 65, 62, 61, 59 (x2), 57, 55, 54 (x2), 53 (x4), 52 (x5), 51 (x4), 50, 48, 47, 46, 45 (x3), 44, 43 (x5), 42, 40, 38 (x4), 37, 35, 34 (x2), 32 (x2), 31, 30, 28 (x2), 27 (x2), 26 (x3), 24 (x2), 23 (x3), 22 (x2), 21, 20, 16 (x4), 15, 13 (x7), 12 (x3), 11 (x6), 10, 9 (x2), 8 (x4), 7 (x8), 6 (x38), 5 (x21), 4 (x7), 3 (x45), 2 (x63), 1 (x13)] - fn ... - fn ... ``` This is a lot more information, distinguishing between root items and inlined items, showing how much duplication there is of inlined items, plus the full range of sizes for CGUs and items within CGUs. All of which is really helpful when analyzing this stuff and trying different CGU formation algorithms.
6f00c8d
to
50acba5
Compare
Because tiny CGUs make compilation less efficient *and* result in worse generated code. We don't do this when the number of CGUs is explicitly given, because there are times when the requested number is very important, as described in some comments within the commit. So the commit also introduces a `CodegenUnits` type that distinguishes between default values and user-specified values. This change has a roughly neutral effect on walltimes across the rustc-perf benchmarks; there are some speedups and some slowdowns. But it has significant wins for most other metrics on numerous benchmarks, including instruction counts, cycles, binary size, and max-rss. It also reduces parallelism, which is good for reducing jobserver competition when multiple rustc processes are running at the same time. It's smaller benchmarks that benefit the most; larger benchmarks already have CGUs that are all larger than the minimum size. Here are some example before/after CGU sizes for opt builds. - html5ever - CGUs: 16, mean size: 1196.1, sizes: [3908, 2992, 1706, 1652, 1572, 1136, 1045, 948, 946, 938, 579, 471, 443, 327, 286, 189] - CGUs: 4, mean size: 4396.0, sizes: [6706, 3908, 3490, 3480] - libc - CGUs: 12, mean size: 35.3, sizes: [163, 93, 58, 53, 37, 8, 2 (x6)] - CGUs: 1, mean size: 424.0, sizes: [424] - tt-muncher - CGUs: 5, mean size: 1819.4, sizes: [8508, 350, 198, 34, 7] - CGUs: 1, mean size: 9075.0, sizes: [9075] Note that CGUs of size 100,000+ aren't unusual in larger programs.
50acba5
to
7c3ce02
Compare
Yes, that's definitely a big part of the improvement here. It's hard to report on because merging currently happens before inlining, so you can't make this measurement without two versions of the compiler, one doing the tiny CGU merging and one not. |
I have addressed the review comments. I also made some very small tweaks to the debug printing, but nothing needing additional review. Thanks! @bors r=wesleywiser |
☀️ Test successful - checks-actions |
Finished benchmarking commit (fa8762b): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 648.99s -> 648.277s (-0.11%) |
Because tiny CGUs slow down compilation and result in worse generated code.
r? @wesleywiser