-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MaybeUninit::assume_init optimizes poorly #74267
Comments
The 12 instruction copy is extra baffling but it's possible that part is outside of Rust's control. ; copy word 0
mov rcx, qword ptr [rsp]
mov qword ptr [rdi], rcx
; copy words 1-2
vmovups xmm0, xmmword ptr [rsp + 8]
vmovups xmmword ptr [rdi + 8], xmm0
; copy word 3
mov rcx, qword ptr [rsp + 24]
mov qword ptr [rdi + 24], rcx
; copy word 2 again (??)
mov rcx, qword ptr [rsp + 16]
mov qword ptr [rdi + 16], rcx
; copy word 3 again (??)
mov rcx, qword ptr [rsp + 24]
mov qword ptr [rdi + 24], rcx
; copy words 4-5
vmovups xmm0, xmmword ptr [rsp + 32]
vmovups xmmword ptr [rdi + 32], xmm0 I might have expected something like this: vmovups ymm0, ymmword ptr [rsp]
vmovups ymm1, ymmword ptr [rsp + 16]
vmovups ymmword ptr [rdi], ymm0
vmovups ymmword ptr [rdi + 16], ymm1 which is what we get from |
Is this something we can even fix on the Rust side, or would we expect LLVM to handle this better? |
@dtolnay The ymm movups get broken up to avoid x86 store forwarding stalls. Of course, that shouldn't be introducing duplicate copies of the same bytes... |
Is there any way to express the semantics of |
It's basically an identity function -- a function that takes an array of some fixed size and returns the same array. |
https://godbolt.org/z/5eKxE5 looks like it's a roughly correct C rendition of this problem. And it reproduces. I'm afraid part of the answer here is "aliasing", but adding |
GCC seems to optimize better than LLVM: https://godbolt.org/z/hWEWvM b:
mov QWORD PTR [rdi], 0
mov QWORD PTR [rdi+8], 0
mov QWORD PTR [rdi+16], 0
mov QWORD PTR [rdi+24], 0
mov QWORD PTR [rdi+32], 0
mov QWORD PTR [rdi+40], 0
mov rax, rdi
ret
c:
mov QWORD PTR [rdi], 0
mov QWORD PTR [rdi+8], 0
mov QWORD PTR [rdi+16], 0
mov QWORD PTR [rdi+24], 0
mov QWORD PTR [rdi+32], 0
mov QWORD PTR [rdi+40], 0
mov rax, rdi
ret |
I've filed an LLVM bug for this: https://bugs.llvm.org/show_bug.cgi?id=47114 |
@nikic do you know if the LLVM upgrade helped with this as well? |
@RalfJung Yes, this now seems to optimize well, presumably due to additional SROA after fully unrolling the initialization loop. We only start seeing memcpys at N=26 and higher, probably because some arbitrary cutoff is reached. I would expect that there would still be an unnecessary memcpy if the loop didn't get unrolled, but I didn't manage to make LLVM not unroll it... |
Closing because the godbolt link from the top of this issue (https://rust.godbolt.org/z/hr77qM) now produces effectively identical asm for all 3 functions. |
Is there a test for this? Should we have one? |
In #74254 we observed that returning expr.assume_init() from a function unexpectedly inhibits the return value from being constructed in place up front.
https://rust.godbolt.org/z/hr77qM
Notice that in the slow function the return value is constructed exactly the same as in both of the fast functions (6 instructions) except in the wrong place, then relocated from [rsp-48] to [rdi] 😢 (12 instructions).
The text was updated successfully, but these errors were encountered: