Filter out short-lived LLVM diagnostics before they reach the rustc handler #113339

lqd · 2023-07-04T21:03:06Z

During profiling I saw remark passes being unconditionally enabled: for example Machine Optimization Remark Emitter.

The diagnostic remarks enabled by default are from missed optimizations and opt analyses. They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc.

Then they are discarded in the vast majority of the time (i.e. unless some kind of -Cremark has enabled some of these passes' output to be printed).

These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) on linux. It's possible that targets without jemalloc or with a worse allocator, may optimize these less.

It is however significant in the aggregate, looking at the total number of allocated bytes:

it's the biggest source of allocations according to dhat, on the benchmarks I've tried e.g. syn or cargo
allocations on syn are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes)
allocations on cargo are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes)

Some of these diagnostics objects are allocated in LLVM before they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.

lqd · 2023-07-04T21:08:01Z

@bors try @rust-timer queue

bors · 2023-07-04T21:33:37Z

⌛ Trying commit 1a8ebe489d2077d188ad02bc2af7455136b39829 with merge 86a42973b773876956b20455a549ebcb24dd9e6e...

tmiasko · 2023-07-04T22:21:01Z

compiler/rustc_llvm/llvm-wrapper/RustWrapper.cpp

@@ -1995,7 +1995,7 @@ extern "C" void LLVMRustContextConfigureDiagnosticHandler(
    std::move(RemarkFile),
    std::move(RemarkStreamer),
    std::move(LlvmRemarkStreamer)
-  ));
+  ), true);


The filtering enabled here takes place after remark have been constructed already:

https://github.com/rust-lang/llvm-project/blob/22897bce7bfedc9cd3953a33419b346936263500/llvm/lib/IR/LLVMContext.cpp#L253-L257

It would be even better to identify which remarks are those and emit them conditionally like all others.

Yeah I stopped here at seeing that our diagnostics handler wasn’t called anymore with optimization diagnostics, to check whether changes to the PGO diagnostics would be impactful.

This is great point, thank you, I will look into that.

Here's some of what I found.

On debug builds for helloworld,diesel,exa,cargo,syn together (the latter 2 generate a lot of optimization diagnostics, we're talking 6-7GBs in total on cargo for example, that RespectFilters = true eliminates), there were 330k diagnostics total.

They are coming from 2 passes, "sdagisel" the more common (260k, 80%), and "asm-printer" next (72k, 20%).

Here's the top 60% to give you an idea:

336578 counts ( 1) 166274 (49.4%, 49.4%): optimization diagnostic received, kind: OptimizationMissed, pass: "sdagisel", message: "FastISel missed terminator" ( 2) 23695 ( 7.0%, 56.4%): optimization diagnostic received, kind: OptimizationMissed, pass: "sdagisel", message: "FastISel didn't lower all arguments: ptr" ( 3) 7822 ( 2.3%, 58.8%): optimization diagnostic received, kind: OptimizationMissed, pass: "sdagisel", message: "FastISel missed" ( 4) 3962 ( 1.2%, 59.9%): optimization diagnostic received, kind: OptimizationAnalysis, pass: "asm-printer", message: "8 instructions in function" ...

The rest are all around the <=1% range, with unique fastisel messages mentioning the function itself with missed optimizations.

With filters enabled, LLVM will not do as many of such allocations, and smaller ones. IIUC they start initially small and sometimes when the pass is enabled some additional details are appended.

In any case:

The asm printer ones are the easiest, they're just optimization analysis diagnostics with the instruction count in a function, coming from here.

I'll point towards a couple of the missed optimizations diagnostics in selection dag isel, but they're going through reportFastISelFailure (so the 4 callers will be enough) just before going through LLVMContext::diagnose you linked above:

the most common "FastISel missed terminator"/"FastISel missed" from here, and this is a case where some details are added on-demand.

"FastISel didn't lower all arguments: ..." from here.

(also fun: these passes seem to go through LegacyPassManager.cpp which I wasn't expecting to still see :)

It feels like maybe we should land RespectFilters = true ourselves, and either open LLVM issues or PRs in the future, what do you think ?

These won't impact max-rss under a good allocator since they're so short-lived, and the improvements on instruction counts locally on cachegrind are understandably small (and on the perfbot it's mostly within noise on syn/cargo).

bors · 2023-07-04T22:50:52Z

☀️ Try build successful - checks-actions
Build commit: 86a42973b773876956b20455a549ebcb24dd9e6e (86a42973b773876956b20455a549ebcb24dd9e6e)

rust-timer · 2023-07-05T00:07:27Z

Finished benchmarking commit (86a42973b773876956b20455a549ebcb24dd9e6e): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.6%	[-0.7%, -0.6%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.6%	[-0.7%, -0.6%]	2

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-3.2%	[-3.2%, -3.2%]	1
All ❌✅ (primary)	-	-	0

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 653.846s -> 654.057s (0.03%)

lqd · 2023-07-18T10:37:41Z

@bors try @rust-timer queue

bors · 2023-07-18T10:37:51Z

⌛ Trying commit 12c57024d5d15802c5b6e5a7b69a3835ed8f714b with merge 583a60dbbea5e6712b308f359519270f875975b0...

bors · 2023-07-18T11:54:34Z

☀️ Try build successful - checks-actions
Build commit: 583a60dbbea5e6712b308f359519270f875975b0 (583a60dbbea5e6712b308f359519270f875975b0)

rust-timer · 2023-07-18T15:25:43Z

Finished benchmarking commit (583a60dbbea5e6712b308f359519270f875975b0): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.5%	[-0.5%, -0.5%]	2
Improvements ✅ (secondary)	-1.0%	[-2.2%, -0.6%]	5
All ❌✅ (primary)	-0.5%	[-0.5%, -0.5%]	2

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-1.6%	[-2.2%, -1.0%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-1.6%	[-2.2%, -1.0%]	2

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.4%	[1.4%, 1.4%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	1.4%	[1.4%, 1.4%]	1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 658.53s -> 657.986s (-0.08%)

lqd · 2023-07-18T15:37:02Z

@tmiasko I've removed RespectFilters = true as we discussed, to do the filtering in our RustDiagnosticHandler when a pass/DI is disabled.

I've also removed the filtering done on the rust side, since that's now unused.

r? tmiasko

compiler/rustc_llvm/llvm-wrapper/RustWrapper.cpp

this will eliminate many short-lived allocations (e.g. 20% of the memory used building cargo) when unpacking the diagnostic and converting its various C++ strings into rust strings, just to be filtered out most of the time.

now that remarks are filtered before cg_llvm's diagnostic handler callback is called, we don't need to do the filtering post c++-to-rust conversion of the diagnostic.

tmiasko · 2023-08-01T21:47:10Z

@bors r+

bors · 2023-08-01T21:47:12Z

📌 Commit ca5a383 has been approved by tmiasko

It is now in the queue for this repository.

bors · 2023-08-01T23:15:23Z

⌛ Testing commit ca5a383 with merge f77c624...

bors · 2023-08-02T01:01:41Z

☀️ Test successful - checks-actions
Approved by: tmiasko
Pushing f77c624 to master...

Filter out short-lived LLVM diagnostics before they reach the rustc handler During profiling I saw remark passes being unconditionally enabled: for example `Machine Optimization Remark Emitter`. The diagnostic remarks enabled by default are [from missed optimizations and opt analyses](rust-lang/rust#113339 (comment)). They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc. Then they are discarded in the vast majority of the time (i.e. unless some kind of `-Cremark` has enabled some of these passes' output to be printed). These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) _on linux_. It's possible that targets without `jemalloc` or with a worse allocator, may optimize these less. It is however significant in the aggregate, looking at the total number of allocated bytes: - it's the biggest source of allocations according to dhat, on the benchmarks I've tried e.g. `syn` or `cargo` - allocations on `syn` are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes) - allocations on `cargo` are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes) Some of these diagnostics objects [are allocated in LLVM](rust-lang/rust#113339 (comment)) *before* they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.

rust-timer · 2023-08-02T07:33:02Z

Finished benchmarking commit (f77c624): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.3%	[0.3%, 0.3%]	1
Regressions ❌ (secondary)	0.4%	[0.3%, 0.5%]	2
Improvements ✅ (primary)	-0.4%	[-0.6%, -0.2%]	3
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.2%	[-0.6%, 0.3%]	4

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: missing data

lqd · 2023-08-02T07:42:59Z

The regressions are in 3 doc benchmarks that look a bit noisy right now, and this PR wouldn't impact rustdoc.

@rustbot label: +perf-regression-triaged

Filter out short-lived LLVM diagnostics before they reach the rustc handler During profiling I saw remark passes being unconditionally enabled: for example `Machine Optimization Remark Emitter`. The diagnostic remarks enabled by default are [from missed optimizations and opt analyses](rust-lang/rust#113339 (comment)). They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc. Then they are discarded in the vast majority of the time (i.e. unless some kind of `-Cremark` has enabled some of these passes' output to be printed). These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) _on linux_. It's possible that targets without `jemalloc` or with a worse allocator, may optimize these less. It is however significant in the aggregate, looking at the total number of allocated bytes: - it's the biggest source of allocations according to dhat, on the benchmarks I've tried e.g. `syn` or `cargo` - allocations on `syn` are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes) - allocations on `cargo` are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes) Some of these diagnostics objects [are allocated in LLVM](rust-lang/rust#113339 (comment)) *before* they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jul 4, 2023

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 4, 2023

tmiasko reviewed Jul 4, 2023

View reviewed changes

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 5, 2023

make opt diagnostic kinds printable

598acff

lqd force-pushed the respect-filters branch from 1a8ebe4 to 12c5702 Compare July 18, 2023 10:06

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 18, 2023

This comment has been minimized.

Sign in to view

lqd changed the title ~~test remark passes filtering on the c++ side~~ Filter out short-lived LLVM diagnostics before they reach the rustc handler Jul 18, 2023

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 18, 2023

lqd marked this pull request as ready for review July 18, 2023 15:36

rustbot assigned tmiasko Jul 18, 2023

tmiasko reviewed Aug 1, 2023

View reviewed changes

compiler/rustc_llvm/llvm-wrapper/RustWrapper.cpp Outdated Show resolved Hide resolved

lqd added 2 commits August 1, 2023 21:01

filter LLVM diagnostics before crossing the rust bridge

77d0110

this will eliminate many short-lived allocations (e.g. 20% of the memory used building cargo) when unpacking the diagnostic and converting its various C++ strings into rust strings, just to be filtered out most of the time.

remove remark filtering on the rust side

ca5a383

now that remarks are filtered before cg_llvm's diagnostic handler callback is called, we don't need to do the filtering post c++-to-rust conversion of the diagnostic.

lqd force-pushed the respect-filters branch from 12c5702 to ca5a383 Compare August 1, 2023 21:05

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 1, 2023

bors added the merged-by-bors This PR was explicitly merged by bors. label Aug 2, 2023

bors merged commit f77c624 into rust-lang:master Aug 2, 2023

rustbot added this to the 1.73.0 milestone Aug 2, 2023

lqd deleted the respect-filters branch August 2, 2023 06:05

rustbot added the perf-regression Performance regression. label Aug 2, 2023

rustbot added the perf-regression-triaged The performance regression has been triaged. label Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out short-lived LLVM diagnostics before they reach the rustc handler #113339

Filter out short-lived LLVM diagnostics before they reach the rustc handler #113339

lqd commented Jul 4, 2023 •

edited

Loading

lqd commented Jul 4, 2023

This comment has been minimized.

bors commented Jul 4, 2023

tmiasko Jul 4, 2023

lqd Jul 4, 2023 •

edited

Loading

lqd Jul 11, 2023

bors commented Jul 4, 2023

This comment has been minimized.

rust-timer commented Jul 5, 2023

lqd commented Jul 18, 2023

This comment has been minimized.

bors commented Jul 18, 2023

bors commented Jul 18, 2023

This comment has been minimized.

rust-timer commented Jul 18, 2023

lqd commented Jul 18, 2023

tmiasko commented Aug 1, 2023

bors commented Aug 1, 2023

bors commented Aug 1, 2023

bors commented Aug 2, 2023

rust-timer commented Aug 2, 2023

lqd commented Aug 2, 2023 •

edited

Loading

Filter out short-lived LLVM diagnostics before they reach the rustc handler #113339

Filter out short-lived LLVM diagnostics before they reach the rustc handler #113339

Conversation

lqd commented Jul 4, 2023 • edited Loading

lqd commented Jul 4, 2023

This comment has been minimized.

bors commented Jul 4, 2023

Choose a reason for hiding this comment

lqd Jul 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bors commented Jul 4, 2023

This comment has been minimized.

rust-timer commented Jul 5, 2023

Overall result: ✅ improvements - no action needed

lqd commented Jul 18, 2023

This comment has been minimized.

bors commented Jul 18, 2023

bors commented Jul 18, 2023

This comment has been minimized.

rust-timer commented Jul 18, 2023

Overall result: ✅ improvements - no action needed

lqd commented Jul 18, 2023

tmiasko commented Aug 1, 2023

bors commented Aug 1, 2023

bors commented Aug 1, 2023

bors commented Aug 2, 2023

rust-timer commented Aug 2, 2023

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

lqd commented Aug 2, 2023 • edited Loading

lqd commented Jul 4, 2023 •

edited

Loading

lqd Jul 4, 2023 •

edited

Loading

lqd commented Aug 2, 2023 •

edited

Loading