-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter out short-lived LLVM diagnostics before they reach the rustc handler #113339
Conversation
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 1a8ebe489d2077d188ad02bc2af7455136b39829 with merge 86a42973b773876956b20455a549ebcb24dd9e6e... |
@@ -1995,7 +1995,7 @@ extern "C" void LLVMRustContextConfigureDiagnosticHandler( | |||
std::move(RemarkFile), | |||
std::move(RemarkStreamer), | |||
std::move(LlvmRemarkStreamer) | |||
)); | |||
), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filtering enabled here takes place after remark have been constructed already:
It would be even better to identify which remarks are those and emit them conditionally like all others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I stopped here at seeing that our diagnostics handler wasn’t called anymore with optimization diagnostics, to check whether changes to the PGO diagnostics would be impactful.
This is great point, thank you, I will look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's some of what I found.
On debug builds for helloworld,diesel,exa,cargo,syn together (the latter 2 generate a lot of optimization diagnostics, we're talking 6-7GBs in total on cargo for example, that RespectFilters = true
eliminates), there were 330k diagnostics total.
They are coming from 2 passes, "sdagisel" the more common (260k, 80%), and "asm-printer" next (72k, 20%).
Here's the top 60% to give you an idea:
336578 counts
( 1) 166274 (49.4%, 49.4%): optimization diagnostic received, kind: OptimizationMissed, pass: "sdagisel", message: "FastISel missed terminator"
( 2) 23695 ( 7.0%, 56.4%): optimization diagnostic received, kind: OptimizationMissed, pass: "sdagisel", message: "FastISel didn't lower all arguments: ptr"
( 3) 7822 ( 2.3%, 58.8%): optimization diagnostic received, kind: OptimizationMissed, pass: "sdagisel", message: "FastISel missed"
( 4) 3962 ( 1.2%, 59.9%): optimization diagnostic received, kind: OptimizationAnalysis, pass: "asm-printer", message: "8 instructions in function"
...
The rest are all around the <=1% range, with unique fastisel messages mentioning the function itself with missed optimizations.
With filters enabled, LLVM will not do as many of such allocations, and smaller ones. IIUC they start initially small and sometimes when the pass is enabled some additional details are appended.
In any case:
The asm printer ones are the easiest, they're just optimization analysis diagnostics with the instruction count in a function, coming from here.
I'll point towards a couple of the missed optimizations diagnostics in selection dag isel, but they're going through reportFastISelFailure
(so the 4 callers will be enough) just before going through LLVMContext::diagnose
you linked above:
- the most common "FastISel missed terminator"/"FastISel missed" from here, and this is a case where some details are added on-demand.
- "FastISel didn't lower all arguments: ..." from here.
(also fun: these passes seem to go through LegacyPassManager.cpp
which I wasn't expecting to still see :)
It feels like maybe we should land RespectFilters = true
ourselves, and either open LLVM issues or PRs in the future, what do you think ?
These won't impact max-rss under a good allocator since they're so short-lived, and the improvements on instruction counts locally on cachegrind are understandably small (and on the perfbot it's mostly within noise on syn/cargo).
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (86a42973b773876956b20455a549ebcb24dd9e6e): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 653.846s -> 654.057s (0.03%) |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 12c57024d5d15802c5b6e5a7b69a3835ed8f714b with merge 583a60dbbea5e6712b308f359519270f875975b0... |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (583a60dbbea5e6712b308f359519270f875975b0): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 658.53s -> 657.986s (-0.08%) |
@tmiasko I've removed I've also removed the filtering done on the rust side, since that's now unused. r? tmiasko |
this will eliminate many short-lived allocations (e.g. 20% of the memory used building cargo) when unpacking the diagnostic and converting its various C++ strings into rust strings, just to be filtered out most of the time.
now that remarks are filtered before cg_llvm's diagnostic handler callback is called, we don't need to do the filtering post c++-to-rust conversion of the diagnostic.
@bors r+ |
☀️ Test successful - checks-actions |
Filter out short-lived LLVM diagnostics before they reach the rustc handler During profiling I saw remark passes being unconditionally enabled: for example `Machine Optimization Remark Emitter`. The diagnostic remarks enabled by default are [from missed optimizations and opt analyses](rust-lang/rust#113339 (comment)). They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc. Then they are discarded in the vast majority of the time (i.e. unless some kind of `-Cremark` has enabled some of these passes' output to be printed). These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) _on linux_. It's possible that targets without `jemalloc` or with a worse allocator, may optimize these less. It is however significant in the aggregate, looking at the total number of allocated bytes: - it's the biggest source of allocations according to dhat, on the benchmarks I've tried e.g. `syn` or `cargo` - allocations on `syn` are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes) - allocations on `cargo` are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes) Some of these diagnostics objects [are allocated in LLVM](rust-lang/rust#113339 (comment)) *before* they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.
Finished benchmarking commit (f77c624): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDNext Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: missing data |
The regressions are in 3 doc benchmarks that look a bit noisy right now, and this PR wouldn't impact rustdoc. @rustbot label: +perf-regression-triaged |
Filter out short-lived LLVM diagnostics before they reach the rustc handler During profiling I saw remark passes being unconditionally enabled: for example `Machine Optimization Remark Emitter`. The diagnostic remarks enabled by default are [from missed optimizations and opt analyses](rust-lang/rust#113339 (comment)). They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc. Then they are discarded in the vast majority of the time (i.e. unless some kind of `-Cremark` has enabled some of these passes' output to be printed). These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) _on linux_. It's possible that targets without `jemalloc` or with a worse allocator, may optimize these less. It is however significant in the aggregate, looking at the total number of allocated bytes: - it's the biggest source of allocations according to dhat, on the benchmarks I've tried e.g. `syn` or `cargo` - allocations on `syn` are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes) - allocations on `cargo` are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes) Some of these diagnostics objects [are allocated in LLVM](rust-lang/rust#113339 (comment)) *before* they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.
Filter out short-lived LLVM diagnostics before they reach the rustc handler During profiling I saw remark passes being unconditionally enabled: for example `Machine Optimization Remark Emitter`. The diagnostic remarks enabled by default are [from missed optimizations and opt analyses](rust-lang/rust#113339 (comment)). They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc. Then they are discarded in the vast majority of the time (i.e. unless some kind of `-Cremark` has enabled some of these passes' output to be printed). These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) _on linux_. It's possible that targets without `jemalloc` or with a worse allocator, may optimize these less. It is however significant in the aggregate, looking at the total number of allocated bytes: - it's the biggest source of allocations according to dhat, on the benchmarks I've tried e.g. `syn` or `cargo` - allocations on `syn` are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes) - allocations on `cargo` are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes) Some of these diagnostics objects [are allocated in LLVM](rust-lang/rust#113339 (comment)) *before* they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.
During profiling I saw remark passes being unconditionally enabled: for example
Machine Optimization Remark Emitter
.The diagnostic remarks enabled by default are from missed optimizations and opt analyses. They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc.
Then they are discarded in the vast majority of the time (i.e. unless some kind of
-Cremark
has enabled some of these passes' output to be printed).These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) on linux. It's possible that targets without
jemalloc
or with a worse allocator, may optimize these less.It is however significant in the aggregate, looking at the total number of allocated bytes:
syn
orcargo
syn
are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes)cargo
are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes)Some of these diagnostics objects are allocated in LLVM before they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.