Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe perf regression in optimized debug builds due to extra UB checks #121245

Closed
RalfJung opened this issue Feb 17, 2024 · 17 comments
Closed

Severe perf regression in optimized debug builds due to extra UB checks #121245

RalfJung opened this issue Feb 17, 2024 · 17 comments
Assignees
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@RalfJung
Copy link
Member

RalfJung commented Feb 17, 2024

In Miri built with optimized and debug assertions, we have recently seen a 50% increase in CI times. Bisecting points at #120594, and @saethlin confirmed that disabling one of these UB checks gives a speedup of the same ballpark as that slowdown.

Several things could be done here:

  • We could just remove this particular check.
  • We could have some way to disable the checks in a particular scope.
  • We could decouple "UB-checking assertions" from the general debug_assertions cfg-flag, similar to how overflow checks already have their own flag.

Update (2024-02-25): CI times are back to normal, now only one specific benchmark is affected.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Feb 17, 2024
@RalfJung RalfJung added I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Feb 17, 2024
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Feb 17, 2024
@jieyouxu jieyouxu added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Feb 17, 2024
@saethlin saethlin changed the title Severe perf regression in optimized debug builds due to extrac UB checks Severe perf regression in optimized debug builds due to extra UB checks Feb 17, 2024
@apiraino
Copy link
Contributor

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-high

@rustbot rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Feb 19, 2024
@clarfonthey
Copy link
Contributor

Just want to inject a few thoughts here as I work on #117494, which (as mentioned in that PR), will probably exacerbate this issue.

First, I noticed that there seems to be a few issues with double-checks getting optimized out for checked methods, as demonstrated in my analysis in this comment. This could be a potential factor in the performance issue, either by making the optimiser work harder to remove them (increasing compile times) or by them being run twice because the optimiser failed to remove them. To clarify, I mean cases like this not being optimised properly:

fn checked_thing(...) {
    if condition(...) {
        unsafe { unchecked_thing(...) }
    } else {
        // ...
    }
}

fn unchecked_thing(...) {
    debug_assert_nounwind!(condition(...), ...);
    // ...
}

Note that in some cases, the two conditions might be logically equivalent but not exactly the same code. (In the example of unchecked_shX, the two conditions were rhs < Self::BITS and rhs & (Self::BITS - 1) < Self::BITS.)


Second, I want to voice support for just making these unsafe asserts into their own dedicated flag that can be disabled, like debug_assertions, but technically separate to it. If we additionally offered this to downstream crate authors, it could be a useful tool to enable for debugging when something goes wrong, but to be disabled if it causes too much stress on testing (like here). Additionally, making it a compiler flag means you can enable it sometimes and disable it sometimes, e.g. ensure it's on when unit testing methods that contain unsafe code but not when testing the safe methods that call them.

@RalfJung
Copy link
Member Author

This could be a potential factor in the performance issue, either by making the optimiser work harder to remove them (increasing compile times) or by them being run twice because the optimiser failed to remove them.

This issue is about the second point: performance of rustc-generated code, not compile times.

Additionally, making it a compiler flag means you can enable it sometimes and disable it sometimes, e.g. ensure it's on when unit testing methods that contain unsafe code but not when testing the safe methods that call them.

How do you set compiler flags on a per-test basis...?

@clarfonthey
Copy link
Contributor

Additionally, making it a compiler flag means you can enable it sometimes and disable it sometimes, e.g. ensure it's on when unit testing methods that contain unsafe code but not when testing the safe methods that call them.

How do you set compiler flags on a per-test basis...?

I mostly meant in the sense of like, having them enabled when testing libraries and not on the larger compiler tests, since generally those have separate test suites. But also, the compiler tests themselves tend to set compiler flags when they're running, so you could configure them there.

Basically, not within a single crate, but for multi-crate projects like the compiler and for compiler tests specifically.

@Noratrieb
Copy link
Member

Noratrieb commented Feb 24, 2024

#121114 will fix or at least significantly help with this. The checks are currently wayyy slower than they need to be because they are outlined for compile time reasons.

@RalfJung
Copy link
Member Author

@Nilstrieb you linked this issue to itself? I assume you mean #121114?

@Noratrieb
Copy link
Member

Noratrieb commented Feb 24, 2024

yes, i don't think the issue will fix itself

@RalfJung
Copy link
Member Author

RalfJung commented Feb 25, 2024

With #121114, CI times are indeed back to normal. 🎉

However, some individual benchmarks still show a severe regression. Specifically, CARGO_EXTRA_FLAGS="--profile=dev" ./miri bench slice-get-unchecked:

before (Miri 5f25162abf547fa3d306d2bdd7277f7a034fce5f):

Benchmark 1: cargo miri run --profile=dev --manifest-path "bench-cargo-miri/slice-get-unchecked/Cargo.toml"
  Time (mean ± σ):      3.810 s ±  0.030 s    [User: 3.758 s, System: 0.048 s]
  Range (min … max):    3.785 s …  3.858 s    5 runs

today:

Benchmark 1: cargo miri run --profile=dev --manifest-path "bench-cargo-miri/slice-get-unchecked/Cargo.toml"
  Time (mean ± σ):      5.766 s ±  0.006 s    [User: 5.723 s, System: 0.040 s]
  Range (min … max):    5.757 s …  5.773 s    5 runs

That's a 50% slowdown. Would be interesting to know if there's one particular check causing this or if it's everywhere. Interestingly, two other benchmarks I tried are completely unaffected, so it seems likely that this is one particular check.

@saethlin
Copy link
Member

CARGO_EXTRA_FLAGS="--profile=dev" ./miri bench slice-get-unchecked:

slice-get-unchecked is a Stack cache stress-test. This benchmark scenario with debug assertions enabled now spends 88% of its runtime validating the Stacked Borrows cache. And using perf to jump into that function, there are no hot preconditions checks. In addition, I've been able to induce swings between -1 and +3 seconds of runtime by deleting parts of that function.

The hot part of verify_cache_consistency (95% of samples in that function) is this loop:

for (idx, item) in self.borrows.iter().enumerate() {
if item.perm() == Permission::Unique {
assert!(
self.unique_range.contains(&idx),
"{:?} {:?}",
self.unique_range,
self.borrows
);
}
}
which even with debug assertions is compiled to what looks to me like the ideal sequence of instructions:

 16.00 │100:┌─→inc    %r10
 15.93 │    │  add    $0xfffffffffffffff8,%r9
 30.41 │    │↓ je     17d
 16.90 │109:├──testb  $0x60,0x7(%r8,%r10,8)
 15.96 │    └──jne    100

I cannot explain why adding checks around this loop is making the program faster or slower. I tried reducing the number of checks by doing a let borrows = self.borrows.as_slice() then replacing all mention of self.borrows with borrows to deduplicate the Vec::deref checks, but that made the benchmark a tiny bit faster, except that I missed the use in the formatting of the assert! and when I replaced that use of self.borrows, the benchmark got slower. I tried deleting the helpful formatting in the assert! of the hot loop, and that also made the benchmark slower.

My only guess is that this benchmark is exceedingly sensitive to code alignment. Or something like that. In any case, the fact that ~84% of the program's cycles are in a perfectly-optimized loop that doesn't contain any checks makes me doubt the causal link between the checks and the slowdown.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 26, 2024

~84% of the program's cycles are in a perfectly-optimized loop that doesn't contain any checks

How can that be true when the program gets 33% faster by removing the checks? Is that 84% of the cycles in the slow or the fast version of the program?

@RalfJung
Copy link
Member Author

I tried deleting the helpful formatting in the assert! of the hot loop, and that also made the benchmark slower.

I can't reproduce this. I applied this diff on old Miri (before your checks were added)

diff --git a/src/borrow_tracker/stacked_borrows/stack.rs b/src/borrow_tracker/stacked_borrows/stack.rs
index 712c26a9..d107bd9a 100644
--- a/src/borrow_tracker/stacked_borrows/stack.rs
+++ b/src/borrow_tracker/stacked_borrows/stack.rs
@@ -161,9 +161,7 @@ impl<'tcx> Stack {
             if item.perm() == Permission::Unique {
                 assert!(
                     self.unique_range.contains(&idx),
-                    "{:?} {:?}",
-                    self.unique_range,
-                    self.borrows
+                    "bad cache",
                 );
             }
         }

Performance is the same with and without this.

@saethlin
Copy link
Member

I ran

git checkout 5f25162abf547fa3d306d2bdd7277f7a034fce5f
./miri toolchain
CARGO_EXTRA_FLAGS="--profile=dev" ./miri bench slice-get-unchecked

And I get:

Benchmark 1: cargo miri run --profile=dev --manifest-path "bench-cargo-miri/slice-get-unchecked/Cargo.toml"
  Time (mean ± σ):      6.261 s ±  0.019 s    [User: 6.193 s, System: 0.058 s]
  Range (min … max):    6.239 s …  6.281 s    5 runs

With the same profile structure: ~84% of cycles are in the verify_cache_consistency loop that walks the entire stack.

But that's on my x86_64 desktop. On an M1 I see 6.8 seconds now and 5.6 seconds on commit 5f25162abf547fa3d306d2bdd7277f7a034fce5f. Still not as dramatic as the change you're seeing... What kind of machine are you on?

@RalfJung
Copy link
Member Author

And I get:

And then when you apply just the diff I showed above it gets slower? Or what was your diff for "deleting the helpful formatting in the assert! of the hot loop"?

What kind of machine are you on?

A Lenovo laptop with an i7-12700H.

@saethlin
Copy link
Member

Or what was your diff for "deleting the helpful formatting in the assert! of the hot loop"?

diff --git a/src/borrow_tracker/stacked_borrows/stack.rs b/src/borrow_tracker/stacked_borrows/stack.rs
index 712c26a9..d107bd9a 100644
--- a/src/borrow_tracker/stacked_borrows/stack.rs
+++ b/src/borrow_tracker/stacked_borrows/stack.rs
@@ -161,9 +161,7 @@ impl<'tcx> Stack {
             if item.perm() == Permission::Unique {
                 assert!(
                     self.unique_range.contains(&idx),
-                    "{:?} {:?}",
-                    self.unique_range,
-                    self.borrows
                 );
             }
         }

@RalfJung
Copy link
Member Author

Yeah that diff makes no perf difference for me. If anything I am seeing a slight speedup (without: 3.806 s ± 0.014 s, with patch: 3.785 s ± 0.006 s).

@apiraino
Copy link
Contributor

apiraino commented Mar 7, 2024

assigning to Nils who is authoring #121114 to signal that this issue is being taken care of. Thanks!

@rustbot assign @Nilstrieb

@RalfJung
Copy link
Member Author

RalfJung commented Apr 7, 2024

Fixed by #123411 -- now one can disable the UB checks if they are too expensive. At least on nightly. Tracking issue: #123499.

@RalfJung RalfJung closed this as completed Apr 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants