Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out short-lived LLVM diagnostics before they reach the rustc handler #113339

Merged
merged 3 commits into from
Aug 2, 2023

Conversation

lqd
Copy link
Member

@lqd lqd commented Jul 4, 2023

During profiling I saw remark passes being unconditionally enabled: for example Machine Optimization Remark Emitter.

The diagnostic remarks enabled by default are from missed optimizations and opt analyses. They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc.

Then they are discarded in the vast majority of the time (i.e. unless some kind of -Cremark has enabled some of these passes' output to be printed).

These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) on linux. It's possible that targets without jemalloc or with a worse allocator, may optimize these less.

It is however significant in the aggregate, looking at the total number of allocated bytes:

  • it's the biggest source of allocations according to dhat, on the benchmarks I've tried e.g. syn or cargo
  • allocations on syn are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes)
  • allocations on cargo are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes)

Some of these diagnostics objects are allocated in LLVM before they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jul 4, 2023
@lqd
Copy link
Member Author

lqd commented Jul 4, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 4, 2023
@bors
Copy link
Contributor

bors commented Jul 4, 2023

⌛ Trying commit 1a8ebe489d2077d188ad02bc2af7455136b39829 with merge 86a42973b773876956b20455a549ebcb24dd9e6e...

@@ -1995,7 +1995,7 @@ extern "C" void LLVMRustContextConfigureDiagnosticHandler(
std::move(RemarkFile),
std::move(RemarkStreamer),
std::move(LlvmRemarkStreamer)
));
), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filtering enabled here takes place after remark have been constructed already:

https://github.com/rust-lang/llvm-project/blob/22897bce7bfedc9cd3953a33419b346936263500/llvm/lib/IR/LLVMContext.cpp#L253-L257

It would be even better to identify which remarks are those and emit them conditionally like all others.

Copy link
Member Author

@lqd lqd Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I stopped here at seeing that our diagnostics handler wasn’t called anymore with optimization diagnostics, to check whether changes to the PGO diagnostics would be impactful.

This is great point, thank you, I will look into that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some of what I found.

On debug builds for helloworld,diesel,exa,cargo,syn together (the latter 2 generate a lot of optimization diagnostics, we're talking 6-7GBs in total on cargo for example, that RespectFilters = true eliminates), there were 330k diagnostics total.

They are coming from 2 passes, "sdagisel" the more common (260k, 80%), and "asm-printer" next (72k, 20%).

Here's the top 60% to give you an idea:

336578 counts
(  1)   166274 (49.4%, 49.4%): optimization diagnostic received, kind: OptimizationMissed, pass: "sdagisel", message: "FastISel missed terminator"
(  2)    23695 ( 7.0%, 56.4%): optimization diagnostic received, kind: OptimizationMissed, pass: "sdagisel", message: "FastISel didn't lower all arguments: ptr"
(  3)     7822 ( 2.3%, 58.8%): optimization diagnostic received, kind: OptimizationMissed, pass: "sdagisel", message: "FastISel missed"
(  4)     3962 ( 1.2%, 59.9%): optimization diagnostic received, kind: OptimizationAnalysis, pass: "asm-printer", message: "8 instructions in function"
...

The rest are all around the <=1% range, with unique fastisel messages mentioning the function itself with missed optimizations.

With filters enabled, LLVM will not do as many of such allocations, and smaller ones. IIUC they start initially small and sometimes when the pass is enabled some additional details are appended.

In any case:

The asm printer ones are the easiest, they're just optimization analysis diagnostics with the instruction count in a function, coming from here.

I'll point towards a couple of the missed optimizations diagnostics in selection dag isel, but they're going through reportFastISelFailure (so the 4 callers will be enough) just before going through LLVMContext::diagnose you linked above:

  • the most common "FastISel missed terminator"/"FastISel missed" from here, and this is a case where some details are added on-demand.
  • "FastISel didn't lower all arguments: ..." from here.

(also fun: these passes seem to go through LegacyPassManager.cpp which I wasn't expecting to still see :)

It feels like maybe we should land RespectFilters = true ourselves, and either open LLVM issues or PRs in the future, what do you think ?

These won't impact max-rss under a good allocator since they're so short-lived, and the improvements on instruction counts locally on cachegrind are understandably small (and on the perfbot it's mostly within noise on syn/cargo).

@bors
Copy link
Contributor

bors commented Jul 4, 2023

☀️ Try build successful - checks-actions
Build commit: 86a42973b773876956b20455a549ebcb24dd9e6e (86a42973b773876956b20455a549ebcb24dd9e6e)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (86a42973b773876956b20455a549ebcb24dd9e6e): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.6% [-0.7%, -0.6%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.6% [-0.7%, -0.6%] 2

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-3.2% [-3.2%, -3.2%] 1
All ❌✅ (primary) - - 0

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 653.846s -> 654.057s (0.03%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 5, 2023
@lqd lqd force-pushed the respect-filters branch from 1a8ebe4 to 12c5702 Compare July 18, 2023 10:06
@lqd
Copy link
Member Author

lqd commented Jul 18, 2023

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 18, 2023
@bors
Copy link
Contributor

bors commented Jul 18, 2023

⌛ Trying commit 12c57024d5d15802c5b6e5a7b69a3835ed8f714b with merge 583a60dbbea5e6712b308f359519270f875975b0...

@bors
Copy link
Contributor

bors commented Jul 18, 2023

☀️ Try build successful - checks-actions
Build commit: 583a60dbbea5e6712b308f359519270f875975b0 (583a60dbbea5e6712b308f359519270f875975b0)

@rust-timer

This comment has been minimized.

@lqd lqd changed the title test remark passes filtering on the c++ side Filter out short-lived LLVM diagnostics before they reach the rustc handler Jul 18, 2023
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (583a60dbbea5e6712b308f359519270f875975b0): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.5% [-0.5%, -0.5%] 2
Improvements ✅
(secondary)
-1.0% [-2.2%, -0.6%] 5
All ❌✅ (primary) -0.5% [-0.5%, -0.5%] 2

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-1.6% [-2.2%, -1.0%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -1.6% [-2.2%, -1.0%] 2

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.4% [1.4%, 1.4%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.4% [1.4%, 1.4%] 1

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 658.53s -> 657.986s (-0.08%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 18, 2023
@lqd lqd marked this pull request as ready for review July 18, 2023 15:36
@lqd
Copy link
Member Author

lqd commented Jul 18, 2023

@tmiasko I've removed RespectFilters = true as we discussed, to do the filtering in our RustDiagnosticHandler when a pass/DI is disabled.

I've also removed the filtering done on the rust side, since that's now unused.

r? tmiasko

lqd added 2 commits August 1, 2023 21:01
this will eliminate many short-lived allocations (e.g. 20% of the memory used
building cargo) when unpacking the diagnostic and converting its various
C++ strings into rust strings, just to be filtered out most of the time.
now that remarks are filtered before cg_llvm's diagnostic handler callback
is called, we don't need to do the filtering post c++-to-rust conversion
of the diagnostic.
@lqd lqd force-pushed the respect-filters branch from 12c5702 to ca5a383 Compare August 1, 2023 21:05
@tmiasko
Copy link
Contributor

tmiasko commented Aug 1, 2023

@bors r+

@bors
Copy link
Contributor

bors commented Aug 1, 2023

📌 Commit ca5a383 has been approved by tmiasko

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 1, 2023
@bors
Copy link
Contributor

bors commented Aug 1, 2023

⌛ Testing commit ca5a383 with merge f77c624...

@bors
Copy link
Contributor

bors commented Aug 2, 2023

☀️ Test successful - checks-actions
Approved by: tmiasko
Pushing f77c624 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Aug 2, 2023
@bors bors merged commit f77c624 into rust-lang:master Aug 2, 2023
@rustbot rustbot added this to the 1.73.0 milestone Aug 2, 2023
github-actions bot pushed a commit to rust-lang/miri that referenced this pull request Aug 2, 2023
Filter out short-lived LLVM diagnostics before they reach the rustc handler

During profiling I saw remark passes being unconditionally enabled: for example `Machine Optimization Remark Emitter`.

The diagnostic remarks enabled by default are [from missed optimizations and opt analyses](rust-lang/rust#113339 (comment)). They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc.

Then they are discarded in the vast majority of the time (i.e. unless some kind of `-Cremark` has enabled some of these passes' output to be printed).

These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) _on linux_. It's possible that targets without `jemalloc` or with a worse allocator, may optimize these less.

It is however significant in the aggregate, looking at the total number of allocated bytes:
- it's the biggest source of allocations according to dhat, on the benchmarks I've tried e.g. `syn` or `cargo`
- allocations on `syn` are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes)
- allocations on `cargo` are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes)

Some of these diagnostics objects [are allocated in LLVM](rust-lang/rust#113339 (comment)) *before* they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.
@lqd lqd deleted the respect-filters branch August 2, 2023 06:05
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (f77c624): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.3% [0.3%, 0.3%] 1
Regressions ❌
(secondary)
0.4% [0.3%, 0.5%] 2
Improvements ✅
(primary)
-0.4% [-0.6%, -0.2%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.2% [-0.6%, 0.3%] 4

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: missing data

@rustbot rustbot added the perf-regression Performance regression. label Aug 2, 2023
@lqd
Copy link
Member Author

lqd commented Aug 2, 2023

The regressions are in 3 doc benchmarks that look a bit noisy right now, and this PR wouldn't impact rustdoc.

image

@rustbot label: +perf-regression-triaged

@rustbot rustbot added the perf-regression-triaged The performance regression has been triaged. label Aug 2, 2023
lnicola pushed a commit to lnicola/rust-analyzer that referenced this pull request Apr 7, 2024
Filter out short-lived LLVM diagnostics before they reach the rustc handler

During profiling I saw remark passes being unconditionally enabled: for example `Machine Optimization Remark Emitter`.

The diagnostic remarks enabled by default are [from missed optimizations and opt analyses](rust-lang/rust#113339 (comment)). They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc.

Then they are discarded in the vast majority of the time (i.e. unless some kind of `-Cremark` has enabled some of these passes' output to be printed).

These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) _on linux_. It's possible that targets without `jemalloc` or with a worse allocator, may optimize these less.

It is however significant in the aggregate, looking at the total number of allocated bytes:
- it's the biggest source of allocations according to dhat, on the benchmarks I've tried e.g. `syn` or `cargo`
- allocations on `syn` are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes)
- allocations on `cargo` are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes)

Some of these diagnostics objects [are allocated in LLVM](rust-lang/rust#113339 (comment)) *before* they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.
RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this pull request Apr 27, 2024
Filter out short-lived LLVM diagnostics before they reach the rustc handler

During profiling I saw remark passes being unconditionally enabled: for example `Machine Optimization Remark Emitter`.

The diagnostic remarks enabled by default are [from missed optimizations and opt analyses](rust-lang/rust#113339 (comment)). They are created by LLVM, passed to the diagnostic handler on the C++ side, emitted to rust, where they are unpacked, C++ strings are converted to rust, etc.

Then they are discarded in the vast majority of the time (i.e. unless some kind of `-Cremark` has enabled some of these passes' output to be printed).

These unneeded allocations are very short-lived, basically only lasting between the LLVM pass emitting them and the rust handler where they are discarded. So it doesn't hugely impact max-rss, and is only a slight reduction in instruction count (cachegrind reports a reduction between 0.3% and 0.5%) _on linux_. It's possible that targets without `jemalloc` or with a worse allocator, may optimize these less.

It is however significant in the aggregate, looking at the total number of allocated bytes:
- it's the biggest source of allocations according to dhat, on the benchmarks I've tried e.g. `syn` or `cargo`
- allocations on `syn` are reduced by 440MB, 17% (from 2440722647 bytes total, to 2030461328 bytes)
- allocations on `cargo` are reduced by 6.6GB, 19% (from 35371886402 bytes total, to 28723987743 bytes)

Some of these diagnostics objects [are allocated in LLVM](rust-lang/rust#113339 (comment)) *before* they're emitted to our diagnostic handler, where they'll be filtered out. So we could remove those in the future, but that will require changing a few LLVM call-sites upstream, so I left a FIXME.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants