-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async performance regression #70488
Comments
Can reproduce. The regression comes from changing the discriminant type from u32 to u8 (or whatever fits best, which almost always is u8), exposing the niche does not regress performance much. Isn't x86 generally less efficient when dealing with <32 bit words? In that case we might want to consider doing this optimization conditionally, since it can be a somewhat significant RAM saving on smaller microcontroller targets. I could imagine using |
I think the slowdown is likely some sort of bug in our codegen or LLVM perhaps? This is a single iteration on nightly (unrolled 8 times in the actual benchmark code): 0.00 : 18a260: mov BYTE PTR [rsp+0x8],0x0
0.80 : 18a265: movzx ecx,BYTE PTR [rsp]
0.99 : 18a269: movzx ecx,BYTE PTR [rsp+0x1]
1.09 : 18a26e: movzx ecx,BYTE PTR [rsp+0x2]
0.87 : 18a273: movzx ecx,BYTE PTR [rsp+0x3]
0.97 : 18a278: movzx ecx,BYTE PTR [rsp+0x4]
0.83 : 18a27d: movzx ecx,BYTE PTR [rsp+0x5]
1.32 : 18a282: movzx ecx,BYTE PTR [rsp+0x6]
0.87 : 18a287: movzx ecx,BYTE PTR [rsp+0x7]
0.84 : 18a28c: movzx ecx,BYTE PTR [rsp+0x9]
0.86 : 18a291: movzx ecx,BYTE PTR [rsp+0xa]
1.01 : 18a296: movzx ecx,BYTE PTR [rsp+0xb]
1.04 : 18a29b: movzx ecx,BYTE PTR [rsp+0x8]
On the good commit, it looks like this (and is only unrolled 4 times): 17.76 : 183f62: mov DWORD PTR [rsp],0x0
0.00 : 183f69: mov edx,DWORD PTR [rsp+0x4]
3.26 : 183f6d: mov edx,DWORD PTR [rsp+0x8]
0.00 : 183f71: mov edx,DWORD PTR [rsp] Trying to reduce this down, I came up with this example: use std::future::Future;
pub async fn foo(v: i32) -> i32 { v }
macro_rules! wrap {
($v:expr) => {
async move { ($v).await }
}
}
pub fn bar(v: i32) -> impl Future<Output = i32> {
wrap!(wrap!(wrap!(wrap!(wrap!(wrap!(foo(v)))))))
} In godbolt, the nightly has an offset of 52 -- and that increases by 8 each additional Nightly: example::foo:
mov eax, edi
ret
example::bar:
mov rax, rdi
mov dword ptr [rdi], esi
mov byte ptr [rdi + 52], 0
ret and on stable: example::foo:
mov eax, edi
ret
example::bar:
mov rax, rdi
mov dword ptr [rdi], esi
mov dword ptr [rdi + 4], 0
ret |
Thanks, that's helpful! |
The dword vs. byte moves can be explained by the alignment changing from 4 (due to using |
Not entirely certain, but I think this is just automatic field reordering doing its thing. The discriminant is now a u8 with alignment 1, so it gets moved to the very end of the memory layout to not waste space with padding. Previously it was a u32 with alignment 4, just like all the other fields in the generator, so reordering anything there doesn't make sense. If I take your example and replace So far this looks like a fairly expected outcome of my PR; it's pretty well-known that smaller alignment can generate less efficient code (OTOH apparently modern x86 chips can do unaligned moves just as fast as aligned ones, so maybe the byte-wise moves should not be emitted, but in the end this is completely up to LLVM). Since, AFAICT, it should only affect futures that do not store any data with alignment in them, the impact should be pretty minimal in practice, so we can also consider closing this without a fix. (this does make me wish that we had better layout debugging capabilities in rustc though) |
A |
Ah, nice to see that land, that's certainly a step in the right direction |
@jonas-schievink a quick test to see if this is indeed an alignment problem is to stick a |
That does not seem to do anything, hmm... Maybe I'm doing it wrong though. Would be useful to get an MCVE that exhibits the inefficient codegen. |
I've just checked the benchmark code, and it does not actually benchmark anything except copying the futures around. They are never polled because they are never spawned onto an executor. The benchmark results would make sense here if the alignment is really the cause. |
How so?
We got |
Enabling Criterion's With the feature enabled the benchmarks now take <1ns, so yeah, they were really just benchmarking the memcpy in |
@kpp |
Ooops. Never felt more embarrassed. |
Jonas, does this sounds better to you?
|
I'd wrap the |
|
use std::{mem, ptr};
use std::future::Future;
async fn ready<T>(t: T) -> T {
t
}
fn black_box<T>(dummy: T) -> T {
unsafe {
let ret = std::ptr::read_volatile(&dummy);
std::mem::forget(dummy);
ret
}
}
#[inline(never)]
fn iter<O, R>(mut routine: R)
where
R: FnMut() -> O,
{
loop {
black_box(routine());
}
}
pub unsafe fn ready_bench() {
iter(move || async {
black_box(ready(42)).await
});
} On stable:
On nightly:
|
Hm, I guess this is not really a bug then.
…On Sun., Mar. 29, 2020, 13:45 Jonas Schievink, ***@***.***> wrote:
Reproduction
<https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=fc0db5e71c27b16c010db95c9053b6ac>
:
use std::{mem, ptr};
use std::future::Future;
async fn ready<T>(t: T) -> T {
t
}
fn black_box<T>(dummy: T) -> T {
unsafe {
let ret = std::ptr::read_volatile(&dummy);
std::mem::forget(dummy);
ret
}
}
#[inline(never)]fn iter<O, R>(mut routine: R)where
R: FnMut() -> O,
{
loop {
black_box(routine());
}
}
pub unsafe fn ready_bench() {
iter(move || async {
black_box(ready(42)).await
});
}
On stable:
playground::iter:
sub rsp, 16
.LBB0_1:
mov dword ptr [rsp], 0
mov eax, dword ptr [rsp + 4]
mov eax, dword ptr [rsp + 8]
mov eax, dword ptr [rsp]
jmp .LBB0_1
playground::ready_bench:
push rax
call playground::iter
ud2
On nightly:
playground::iter: # @playground::iter
# %bb.0:
sub rsp, 16
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov byte ptr [rsp + 8], 0
movzx eax, byte ptr [rsp]
movzx eax, byte ptr [rsp + 1]
movzx eax, byte ptr [rsp + 2]
movzx eax, byte ptr [rsp + 3]
movzx eax, byte ptr [rsp + 4]
movzx eax, byte ptr [rsp + 5]
movzx eax, byte ptr [rsp + 6]
movzx eax, byte ptr [rsp + 7]
movzx eax, byte ptr [rsp + 9]
movzx eax, byte ptr [rsp + 10]
movzx eax, byte ptr [rsp + 11]
movzx eax, byte ptr [rsp + 8]
jmp .LBB0_1
# -- End function
playground::ready_bench: # @playground::ready_bench
# %bb.0:
push rax
call playground::iter
ud2
# -- End function
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#70488 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJO2TUM5ALCGWNW2ABK3WTRJ6XQFANCNFSM4LVMOP5Q>
.
|
The actual LLVM type did change to have an alignment of 1: ; stable
%"std::future::GenFuture<ready_bench::{{closure}}::{{closure}}>" = type { [0 x i32], %"ready_bench::{{closure}}::{{closure}}", [0 x i32] }
%"ready_bench::{{closure}}::{{closure}}" = type { [0 x i32], i32, [2 x i32] }
; nightly
%"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>" = type { [0 x i32], %"ready_bench::{{closure}}::{{closure}}", [0 x i32] }
%"ready_bench::{{closure}}::{{closure}}" = type { [8 x i8], i8, [3 x i8] } One odd thing is that the volatile load in the LLVM IR is passed ; playground::iter
; Function Attrs: noinline noreturn nounwind nonlazybind uwtable
define internal fastcc void @_ZN10playground4iter17h1130617e37f08519E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
%_3 = alloca %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", align 8
%_3.0.sroa_cast = bitcast %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3 to i8*
%_3.8.sroa_idx = getelementptr inbounds %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3, i64 0, i32 1, i32 1
br label %bb1
bb1: ; preds = %bb1, %start
call void @llvm.lifetime.start.p0i8(i64 12, i8* nonnull %_3.0.sroa_cast)
store i8 0, i8* %_3.8.sroa_idx, align 8, !alias.scope !2
%_3.0. = load volatile %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3, align 8, !alias.scope !8, !noalias !11
call void @llvm.lifetime.end.p0i8(i64 12, i8* nonnull %_3.0.sroa_cast)
br label %bb1
} |
Cf: http://www.idryman.org/blog/2012/11/21/integer-promotion/ In C this potential cause of performance regression cannot happen because of Integer promotion If so, couldn't rust support integer promotion? At the very least as an optional flag in order to enforce performance garanties? |
The PR that caused this intentionally did the opposite of integer promotion (narrowing a u32 to a u8), just not in a way that's visible to users. Generally rustc/LLVM is free to do integer promotion or narrowing whenever it sees fit, as long as it doesn't change the observed behavior of the program. I haven't completely nailed down this issue yet as I'm not an LLVM expert, but it seems like it's caused by LLVM using only the LLVM type for alignment decisions, not the explicit alignment of the |
I have no idea if this is really relevant to the conversation but LLVM is introducing a new alignment type that could solve some issues http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html |
We discussed this in the wg-async-foundations meeting and several people wanted to close as won't fix. Without evidence that this causes performance regressions in real-world code, it may not be worth pursuing further. We also don't think it's worth making generators bigger or changing their alignment to improve a microbenchmark. That said, it does seem like LLVM could be generating better assembly, so pinging those folks for more input: @rustbot ping llvm |
Hey LLVM ICE-breakers! This bug has been identified as a good cc @comex @cuviper @DutchGhost @hanna-kruppe @hdhoang @heyrutvik @JOE1994 @jryans @mmilenko @nagisa @nikic @Noah-Kennedy @SiavoshZarrasvand @spastorino @vertexclique @vgxbj |
For the "reproduction", the issue isn't the alignment; it's the way the example reimplements Longer explanation: LLVM implements volatile loads/stores of structs as volatile loads/stores of their fields; if the struct consists of N fields, it will always emit N separate load/store instructions. (Without volatile it will optimize them into something nicer.) Note that LLVM does this regardless of the size of struct, as long as the LLVM IR has a load of struct type ( Regardless, the result is unpredictable, nonintuitive, and undocumented, in both Rust and C. Luckily, nobody actually uses volatile loads/stores of entire structs in cases where it matters, like accessing hardware MMIO registers: the registers may or may not be represented in a struct, but the actual loads and stores are done to individual fields. |
Ah, I see, that explains the emitted code, thanks! It also explains what was observed in #66595 I suppose. Well, I don't think there's much of a point in leaving this issue open then, so closing. |
To sum up.
|
See #69033 (comment); @kpp reports a 200%+ regression on some futures microbenchmarks, e.g.:
The benchmarks are at https://github.com/kpp/futures-async-combinators/.
I bisected this to 57e1da5, which most likely implicates #69837.
The text was updated successfully, but these errors were encountered: