MaybeUninit::assume_init optimizes poorly #74267

dtolnay · 2020-07-12T12:44:36Z

In #74254 we observed that returning expr.assume_init() from a function unexpectedly inhibits the return value from being constructed in place up front.

https://rust.godbolt.org/z/hr77qM

#![allow(deprecated)]

use std::mem::{self, MaybeUninit};
use std::ptr;

type T = String;
const N: usize = 2;

// fast
pub fn a() -> [T; N] {
    Default::default()
}

// fast
pub fn b() -> [T; N] {
    unsafe {
        // ignore the UB for now
        let mut array: [T; N] = mem::uninitialized();
        for slot in &mut array {
            ptr::write(slot, T::default());
        }
        array
    }
}

// slow
pub fn c() -> [T; N] {
    let mut array: MaybeUninit<[T; N]> = MaybeUninit::uninit();
    unsafe {
        // ignore the UB for now
        // ordinarily would cast to &mut [MaybeUninit<T>; N]
        // but here we try to minimize difference from `b`
        let slots = &mut *array.as_mut_ptr();
        for slot in slots {
            ptr::write(slot, T::default());
        }
        array.assume_init()
    }
}

Notice that in the slow function the return value is constructed exactly the same as in both of the fast functions (6 instructions) except in the wrong place, then relocated from [rsp-48] to [rdi] 😢 (12 instructions).

example::a:
        mov     rax, rdi
        mov     qword ptr [rdi], 1
        vxorps  xmm0, xmm0, xmm0
        vmovups xmmword ptr [rdi + 8], xmm0
        mov     qword ptr [rdi + 24], 1
        vmovups xmmword ptr [rdi + 32], xmm0
        ret

example::b:
        mov     rax, rdi
        mov     qword ptr [rdi], 1
        vxorps  xmm0, xmm0, xmm0
        vmovups xmmword ptr [rdi + 8], xmm0
        mov     qword ptr [rdi + 24], 1
        vmovups xmmword ptr [rdi + 32], xmm0
        ret

example::c:
        sub     rsp, 48
        mov     rax, rdi
        mov     qword ptr [rsp], 1
        vxorps  xmm0, xmm0, xmm0
        vmovups xmmword ptr [rsp + 8], xmm0
        mov     qword ptr [rsp + 24], 1
        vmovups xmmword ptr [rsp + 32], xmm0
        mov     rcx, qword ptr [rsp]
        mov     qword ptr [rdi], rcx
        vmovups xmm0, xmmword ptr [rsp + 8]
        vmovups xmmword ptr [rdi + 8], xmm0
        mov     rcx, qword ptr [rsp + 24]
        mov     qword ptr [rdi + 24], rcx
        mov     rcx, qword ptr [rsp + 16]
        mov     qword ptr [rdi + 16], rcx
        mov     rcx, qword ptr [rsp + 24]
        mov     qword ptr [rdi + 24], rcx
        vmovups xmm0, xmmword ptr [rsp + 32]
        vmovups xmmword ptr [rdi + 32], xmm0
        add     rsp, 48
        ret

dtolnay · 2020-07-12T12:55:44Z

The 12 instruction copy is extra baffling but it's possible that part is outside of Rust's control.

    ; copy word 0
        mov     rcx, qword ptr [rsp]
        mov     qword ptr [rdi], rcx
    ; copy words 1-2
        vmovups xmm0, xmmword ptr [rsp + 8]
        vmovups xmmword ptr [rdi + 8], xmm0
    ; copy word 3
        mov     rcx, qword ptr [rsp + 24]
        mov     qword ptr [rdi + 24], rcx
    ; copy word 2 again (??)
        mov     rcx, qword ptr [rsp + 16]
        mov     qword ptr [rdi + 16], rcx
    ; copy word 3 again (??)
        mov     rcx, qword ptr [rsp + 24]
        mov     qword ptr [rdi + 24], rcx
    ; copy words 4-5
        vmovups xmm0, xmmword ptr [rsp + 32]
        vmovups xmmword ptr [rdi + 32], xmm0

I might have expected something like this:

        vmovups ymm0, ymmword ptr [rsp]
        vmovups ymm1, ymmword ptr [rsp + 16]
        vmovups ymmword ptr [rdi], ymm0
        vmovups ymmword ptr [rdi + 16], ymm1

which is what we get from fn cpy(input: [usize; 6]) -> [usize; 6] { input }. https://rust.godbolt.org/z/P5PzYs

RalfJung · 2020-07-12T13:07:44Z

Is this something we can even fix on the Rust side, or would we expect LLVM to handle this better?

nikic · 2020-07-12T15:19:48Z

@dtolnay The ymm movups get broken up to avoid x86 store forwarding stalls. Of course, that shouldn't be introducing duplicate copies of the same bytes...

alex · 2020-08-09T18:08:21Z

Is there any way to express the semantics of assume_init in C code? That's my traditional next step in trying to understand LLVM optimization differences.

RalfJung · 2020-08-09T18:36:22Z

It's basically an identity function -- a function that takes an array of some fixed size and returns the same array.

alex · 2020-08-09T20:20:55Z

https://godbolt.org/z/5eKxE5 looks like it's a roughly correct C rendition of this problem. And it reproduces. I'm afraid part of the answer here is "aliasing", but adding -Z mutable-noalias=yes to the Rust reproducer didn't solve it.

tesuji · 2020-08-10T00:21:00Z

GCC seems to optimize better than LLVM: https://godbolt.org/z/hWEWvM

b:
        mov     QWORD PTR [rdi], 0
        mov     QWORD PTR [rdi+8], 0
        mov     QWORD PTR [rdi+16], 0
        mov     QWORD PTR [rdi+24], 0
        mov     QWORD PTR [rdi+32], 0
        mov     QWORD PTR [rdi+40], 0
        mov     rax, rdi
        ret
c:
        mov     QWORD PTR [rdi], 0
        mov     QWORD PTR [rdi+8], 0
        mov     QWORD PTR [rdi+16], 0
        mov     QWORD PTR [rdi+24], 0
        mov     QWORD PTR [rdi+32], 0
        mov     QWORD PTR [rdi+40], 0
        mov     rax, rdi
        ret

alex · 2020-08-11T12:21:03Z

I've filed an LLVM bug for this: https://bugs.llvm.org/show_bug.cgi?id=47114

RalfJung · 2021-03-13T08:59:37Z

@nikic do you know if the LLVM upgrade helped with this as well?

nikic · 2021-03-13T09:27:49Z

@RalfJung Yes, this now seems to optimize well, presumably due to additional SROA after fully unrolling the initialization loop. We only start seeing memcpys at N=26 and higher, probably because some arbitrary cutoff is reached. I would expect that there would still be an unnecessary memcpy if the loop didn't get unrolled, but I didn't manage to make LLVM not unroll it...

dtolnay · 2021-08-21T20:37:17Z

Closing because the godbolt link from the top of this issue (https://rust.godbolt.org/z/hr77qM) now produces effectively identical asm for all 3 functions.

RalfJung · 2021-08-22T16:57:20Z

Is there a test for this? Should we have one?

dtolnay added I-slow Issue: Problems and improvements with respect to performance of generated code. A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jul 12, 2020

MikailBag mentioned this issue Aug 9, 2020

Use const generics for array Default impl #61415

Open

3 tasks

lcnr mentioned this issue Aug 13, 2020

Default for arrays via const generics #74254

Closed

dtolnay closed this as completed Aug 21, 2021

RalfJung added the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label Aug 30, 2021

RalfJung reopened this Aug 30, 2021

elichai mentioned this issue Jul 4, 2022

Should we use MaybeUnint? rust-bitcoin/rust-secp256k1#469

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MaybeUninit::assume_init optimizes poorly #74267

MaybeUninit::assume_init optimizes poorly #74267

dtolnay commented Jul 12, 2020 •

edited

Loading

dtolnay commented Jul 12, 2020 •

edited

Loading

RalfJung commented Jul 12, 2020

nikic commented Jul 12, 2020

alex commented Aug 9, 2020

RalfJung commented Aug 9, 2020

alex commented Aug 9, 2020

tesuji commented Aug 10, 2020

alex commented Aug 11, 2020

RalfJung commented Mar 13, 2021

nikic commented Mar 13, 2021

dtolnay commented Aug 21, 2021

RalfJung commented Aug 22, 2021

MaybeUninit::assume_init optimizes poorly #74267

MaybeUninit::assume_init optimizes poorly #74267

Comments

dtolnay commented Jul 12, 2020 • edited Loading

dtolnay commented Jul 12, 2020 • edited Loading

RalfJung commented Jul 12, 2020

nikic commented Jul 12, 2020

alex commented Aug 9, 2020

RalfJung commented Aug 9, 2020

alex commented Aug 9, 2020

tesuji commented Aug 10, 2020

alex commented Aug 11, 2020

RalfJung commented Mar 13, 2021

nikic commented Mar 13, 2021

dtolnay commented Aug 21, 2021

RalfJung commented Aug 22, 2021

dtolnay commented Jul 12, 2020 •

edited

Loading

dtolnay commented Jul 12, 2020 •

edited

Loading