-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the 64b inner:monotonize() implementation not the 128b one for aarch64 #88651
Conversation
…rch64 aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees untorn 128b reads except for completing a 128b load/store exclusive pair (ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to complete a 128b read+write atomic is actually more expensive and more unfair than the previous implementation of monotonize() which used a Mutex on aarch64, especially at large core counts. For aarch64 switch to the 64b atomic implementation which is about 13x faster for a benchmark that involves many calls to Instant::now().
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @dtolnay (or someone else) soon. Please see the contribution instructions for more information. |
(continueing discussion from #88652) I didn't expect any platform to use a read-write instruction to emulate 128bit atomic loads. In that case would it make make sense to make this implementation conditional on the |
Unfortunately, no.
|
A little program that demonstrates the performance impact: |
Any feedback? |
@rustbot ping arm |
Hey ARM Group! This bug has been identified as a good "ARM candidate". cc @adamgemmell @hug-dev @jacobbramley @JamieCunliffe @joaopaulocarreiro @raw-bin @Stammark |
When you do that can you also open an issue here linking to the upstream bugs so we can update the cfg flags on the AtomicU128 version once they're fixed? |
@the8472 Absolutely, but the condition will become more complicated as we'll have to have both the architecture check and a target-flags check. I'm honestly not clear how we can detect if the target-feature we're passing in Assuming #88652 is merged this will end up not being needed for aarch64/linux at least; other OSes will continue to benefit. |
I wonder what the performance is like on small systems (i.e. those with few cores). However, I think this PR is correct in the functional sense. Investigating the LSE2 issue a bit, I've been reminded that, in general, we can't simply mix the legacy mutex- or ldxp/stxp-based atomics with LSE2's versions, at least not for the same objects. I think that LLVM won't generate the 8.4 sequence because doing so would make it ABI-incompatible with earlier code. Now, that might not actually be an issue here (e.g. if the field will never be exposed to such code), but that doesn't mean that LLVM provides a way to express that. I'm not sure, but it may be difficult to use 8.4 LSE2 sequences without defining a new target, or requiring the use of Cargo's build-std. Our PAuth/BTI proposal (#88354) has a similar requirement. |
The performance will be better in small systems too. Using the 64b version means that the load is single-copy-atomic without the heavy weight exclusive-pair completing. That should be good for all systems, big or small.
I don’t think you mean LSE2 but perhaps just LSE. AFAIK, there isn’t anything in the architecture that makes the code in this case incorrect, it’s just perhaps non-performant.
I haven’t found any evidence looking through the llvm source that the fact that in arm v8.4 16B aligned loads are single-copy atomic was implemented. I also don’t see how it’s an ABI break. 16B loads continue to have to be naturally aligned. They have to be for the exclusive-pair instructions already. Can you explain? It absolutely is an Arm architecture version problem in that it is only guaranteed by the architecture in v8.4+ and generally people build generic v8.0+ code, but I just don’t think this feature has been implemented in LLVM yet. Even if it had, it applies to so few of the Arm systems in the world it’a not worth agonizing over currently. |
I must admit that I didn't look in detail at the 64-bit version, but I understood that it used a mutex to protect a standard (non-atomic) pair access, which isn't obviously better (though it could be, and I won't dispute it).
I'm still digging, to be honest, and some of my comment was a bit confused! I've seen some implementations use a locking implementation, but possibly individual loads and stores within the locks (rather than ldp/stp). How relevant they are, I don't know, especially as they'd be incompatible with a lock-free ldxp/stxp for the same reasons. Anyway, this isn't blocking this PR so let's continue this thread elsewhere (e.g. an issue or Zulip as you prefer). Ultimately, I would like to see the LSE2 sequences properly supported.
I've just been shown this: https://reviews.llvm.org/D109827 |
The 64b version doesn't use a mutex, it has slightly more code it handle rollover, but it's 13x faster precisely b/c it's not getting a mutex or emulating a single-copy-atomic load with a ldxp/stxp instruction pair which is effectively the same cost as getting a mutex.
Awesome! However, it would still require dynamically detecting that the processor version supports v8.4 and choosing the correct code path. Overall I doubt it's worth the effort given the additional code probably isn't any better than the current role-over code for quite a bit more complexity. |
Sounds good, thanks! Please approve, @dtolnay. |
Uh, I looked over the AtomicU64 implementation again and noticed a fairly egregious mistake. It's only using a load and store, not a RMW atomic. That poisons any benchmark results since it's too good to be true. I'll submit a fix. |
You might want to redo your benchmarks on top of #89017 to make sure that the AtomicU64 approach still has a significant advantage over AtomicU128. |
Thanks @the8472, this still looks to help 2-3x, but not the 13x it did previously. I'll take a closer look at the generated code early next week. |
I've retested this and yes it continues to help to use the 64b version instead of the 128b version by about 3.5x on the attached reproducer (down from 13x) with the fix from @the8472. Staring at the generate assembly i can't say i understand why except maybe that the critical section between the ld/st exclusive is a lot smaller in the case of the 64b fetch-update vs the max operation that requires testing both 64b. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, all :)
@bors r+ |
📌 Commit ce450f8 has been approved by |
…4, r=dtolnay Use the 64b inner:monotonize() implementation not the 128b one for aarch64 aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees untorn 128b reads except for completing a 128b load/store exclusive pair (ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to complete a 128b read+write atomic is actually more expensive and more unfair than the previous implementation of monotonize() which used a Mutex on aarch64, especially at large core counts. For aarch64 switch to the 64b atomic implementation which is about 13x faster for a benchmark that involves many calls to Instant::now().
…4, r=dtolnay Use the 64b inner:monotonize() implementation not the 128b one for aarch64 aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees untorn 128b reads except for completing a 128b load/store exclusive pair (ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to complete a 128b read+write atomic is actually more expensive and more unfair than the previous implementation of monotonize() which used a Mutex on aarch64, especially at large core counts. For aarch64 switch to the 64b atomic implementation which is about 13x faster for a benchmark that involves many calls to Instant::now().
Hm, it's worth noting this PR recently made its way in as well, and the two might have some impact on each other: #83655 |
…4, r=dtolnay Use the 64b inner:monotonize() implementation not the 128b one for aarch64 aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees untorn 128b reads except for completing a 128b load/store exclusive pair (ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to complete a 128b read+write atomic is actually more expensive and more unfair than the previous implementation of monotonize() which used a Mutex on aarch64, especially at large core counts. For aarch64 switch to the 64b atomic implementation which is about 13x faster for a benchmark that involves many calls to Instant::now().
…arth Rollup of 12 pull requests Successful merges: - rust-lang#87631 (os current_exe using same approach as linux to get always the full ab…) - rust-lang#88234 (rustdoc-json: Don't ignore impls for primitive types) - rust-lang#88651 (Use the 64b inner:monotonize() implementation not the 128b one for aarch64) - rust-lang#88816 (Rustdoc migrate to table so the gui can handle >2k constants) - rust-lang#89244 (refactor: VecDeques PairSlices fields to private) - rust-lang#89364 (rustdoc-json: Encode json files with UTF-8) - rust-lang#89423 (Fix ICE caused by non_exaustive_omitted_patterns struct lint) - rust-lang#89426 (bootstrap: add config option for nix patching) - rust-lang#89462 (haiku thread affinity build fix) - rust-lang#89482 (Follow the diagnostic output style guide) - rust-lang#89504 (Don't suggest replacing region with 'static in NLL) - rust-lang#89535 (fix busted JavaScript in error index generator) Failed merges: r? `@ghost` `@rustbot` modify labels: rollup
aarch64 prior to v8.4 (FEAT_LSE2) doesn't have an instruction that guarantees
untorn 128b reads except for completing a 128b load/store exclusive pair
(ldxp/stxp) or compare-and-swap (casp) successfully. The requirement to
complete a 128b read+write atomic is actually more expensive and more unfair
than the previous implementation of monotonize() which used a Mutex on aarch64,
especially at large core counts. For aarch64 switch to the 64b atomic
implementation which is about 13x faster for a benchmark that involves many
calls to Instant::now().