Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vec::push should load a passed-by-pointer argument *after* reserve_for_push #115242

Open
scottmcm opened this issue Aug 26, 2023 · 8 comments
Open
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@scottmcm
Copy link
Member

( Inspired by @erikdesjardins 's example in #115212 (comment) )

When pushing a [u8; 16] (which is passed indirectly)

pub fn push(vec: &mut Vec<[u8; 16]>, data: [u8; 16]) {
    vec.push(data);
}

On nightly today we load data to an xmm register, then spill it to stack to call reserve_for_push, then load it again https://godbolt.org/z/4jTK4r3Ko

example::push:
        push    rbx
        sub     rsp, 16
        mov     rbx, rdi
        movups  xmm0, xmmword ptr [rsi]
        mov     rsi, qword ptr [rdi + 16]
        cmp     rsi, qword ptr [rdi + 8]
        jne     .LBB2_2
        mov     rdi, rbx
        movaps  xmmword ptr [rsp], xmm0    ;    <-- LOOK
        call    alloc::raw_vec::RawVec<T,A>::reserve_for_push
        movaps  xmm0, xmmword ptr [rsp]    ;    <-- LOOK
        mov     rsi, qword ptr [rbx + 16]
.LBB2_2:
        mov     rax, qword ptr [rbx]
        mov     rcx, rsi
        shl     rcx, 4
        movups  xmmword ptr [rax + rcx], xmm0
        inc     rsi
        mov     qword ptr [rbx + 16], rsi
        add     rsp, 16
        pop     rbx
        ret

We should make the stack adjustment unnecessary here -- ideally by loading it after the reserve_for_push, but other things like tweaking calling conventions could help too.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Aug 26, 2023
@the8472
Copy link
Member

the8472 commented Aug 26, 2023

Would using coldcc for reserve_for_push help?

@erikdesjardins
Copy link
Contributor

erikdesjardins commented Aug 26, 2023

Looking at what happens in each version (LLVM IR greatly simplified/pseudo-ified for exposition)...


1.71: https://godbolt.org/z/9rY5K3v8c

We generate:

define void @push(ptr %vec, ptr %array) {
  %alloca = alloca [16 x i8]
  call void @memcpy(%alloca, %array)
  call void @Vec::push(%vec, %alloca)
  ret void
}

After inlining:

define void @push(ptr %vec, ptr %array) {
  %alloca = alloca [16 x i8]
  call void @memcpy(%alloca, %array)
call_reserve:
  call void @Vec::reserve_for_push(%vec)
exit:
  call void @memcpy(%vec.buf, %alloca)
  ret void
}

Then instcombine optimizes out the intermediate alloca, and we end up with a memcpy directly from the argument to the vec:

define void @push(ptr %vec, ptr %array) {
  call void @Vec::reserve_for_push(%vec)
exit:
  call void @memcpy(%vec.buf, %array)
  ret void
}

1.72: https://godbolt.org/z/T17znjnxG

We generate:

define void @push(ptr %vec, ptr %array) {
  %alloca = alloca [16 x i8]
  %tmp = load %array
  store %array, %alloca
  call void @Vec::push(%vec, %alloca)
  ret void
}

After inlining:

define void @push(ptr %vec, ptr %array) {
  %alloca = alloca [16 x i8]
  %tmp = load %array
  store %array, %alloca
  %tmp2 = load %alloca
call_reserve:
  call void @Vec::reserve_for_push(%vec)
exit:
  store %tmp2, %vec.buf
  ret void
}

Then SROA optimizes out the intermediate alloca, and we're left with one of the loads in the entry block:

define void @push(ptr %vec, ptr %array) {
  %tmp2 = load %array
call_reserve:
  call void @Vec::reserve_for_push(%vec)
exit:
  store %tmp2, %vec.buf
  ret void
}

nightly: https://godbolt.org/z/Tb38hGPjE

Unlike 1.71 and 1.72, before inlining there is no intermediate alloca. We directly pass the array ptr to Vec::push (due to a new mir opt on nightly? dest prop?):

define void @push(ptr %vec, ptr %array) {
  call void @Vec::push(%vec, %array)
  ret void
}

I believe this would result in ideal codegen*...except, argpromotion converts the array to be passed by value:

define void @push(ptr %vec, ptr %array) {
  %tmp = load %array
  call void @Vec::push(%vec, %tmp)
  ret void
}

Then, when Vec::push is inlined, we end up with the same thing as 1.72:

define void @push(ptr %vec, ptr %array) {
  %tmp = load %array
call_reserve:
  call void @Vec::reserve_for_push(%vec)
exit:
  store %tmp, %vec.buf
  ret void
}

* because, before argpromotion, Vec::push looks like this:

`Vec::push` LLVM IR
define internal fastcc void @"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17h69513e07f1ac3cddE"(ptr noalias noundef align 8 dereferenceable(24) %self, ptr noalias nocapture noundef align 1 dereferenceable(16) %value) unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  %0 = getelementptr inbounds %"alloc::vec::Vec<[u8; 16]>", ptr %self, i64 0, i32 1
  %_4 = load i64, ptr %0, align 8
  %1 = getelementptr inbounds { ptr, i64 }, ptr %self, i64 0, i32 1
  %2 = load i64, ptr %1, align 8
  %_3 = icmp eq i64 %_4, %2
  br i1 %_3, label %bb1, label %bb3

bb3:                                              ; preds = %bb1, %start
  %self1 = load ptr, ptr %self, align 8
  %count = load i64, ptr %0, align 8
  %end = getelementptr inbounds [16 x i8], ptr %self1, i64 %count
  %3 = load <16 x i8>, ptr %value, align 1
  store <16 x i8> %3, ptr %end, align 1
  %4 = load i64, ptr %0, align 8
  %5 = add i64 %4, 1
  store i64 %5, ptr %0, align 8
  ret void

bb1:                                              ; preds = %start
  call fastcc void @"_ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17h2c9a8b8fbc06deffE"(ptr noalias noundef nonnull align 8 dereferenceable(16) %self, i64 noundef %_4)
  br label %bb3
}

Note the adjacent:

%3 = load <16 x i8>, ptr %value, align 1
store <16 x i8> %3, ptr %end, align 1

So, to resolve this, I believe LLVM either:

  • needs to be able to sink that load
  • needs to run argpromotion after inlining

The latter makes sense to me even just in terms of compile time (argpromotion should do much less work after inlining), but maybe there's a good reason the pipeline is ordered that way.

@erikdesjardins
Copy link
Contributor

Actually, argpromotion may not be the culprit. It only looks that way because our godbolt examples export push, which artifically prevents argpromotion from changing its signature.

In real programs, this wouldn't be the case, and argpromotion would be able to change the signature all the way up to the original value being pushed, which will either be:

  • a scalar value that was spilled just so it could be passed indirectly, in which case argpromotion is a strict improvement
  • a load from some other memory, in which case the original load already occurs before reserve_for_push

@scottmcm
Copy link
Member Author

@the8472 Yeah, that's what I was thinking about when I said tweaking calling conventions 🙂

I tried it out today, and it does make the push look nicer -- no more sub rsp showing up -- but at the cost of lots of extra work when it does need to reallocate:

_ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17h78533106871c2e4bE:
.seh_proc _ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17h78533106871c2e4bE
	push	r11
	.seh_pushreg r11
	push	r10
	.seh_pushreg r10
	push	r9
	.seh_pushreg r9
	push	r8
	.seh_pushreg r8
	push	rdi
	.seh_pushreg rdi
	push	rsi
	.seh_pushreg rsi
	push	rdx
	.seh_pushreg rdx
	push	rcx
	.seh_pushreg rcx
	sub	rsp, 184
	.seh_stackalloc 184
	movaps	xmmword ptr [rsp + 160], xmm5
	.seh_savexmm xmm5, 160
	movaps	xmmword ptr [rsp + 144], xmm4
	.seh_savexmm xmm4, 144
	movaps	xmmword ptr [rsp + 128], xmm3
	.seh_savexmm xmm3, 128
	movaps	xmmword ptr [rsp + 112], xmm2
	.seh_savexmm xmm2, 112
	movaps	xmmword ptr [rsp + 96], xmm1
	.seh_savexmm xmm1, 96
	movaps	xmmword ptr [rsp + 80], xmm0
	.seh_savexmm xmm0, 80
	.seh_endprologue

So I'm not confident it's the right choice, but I tossed it up in #115256 to see what perf says regardless.

As @reinerp mentioned in #97544 (comment), this probably wants something that doesn't have to save-and-restore the simd registers, but preserve_mostcc is still described as "experimental" in langref (and says it only works on x64, but also describes the behaviour on AArch64, so who really knows).

@scottmcm
Copy link
Member Author

Oh, for extra "fun", this is opt-level-dependant today 😬

I was wondering why it wasn't reproing for me locally, and it was because I was trying with -O not -C opt-level=3
image
https://rust.godbolt.org/z/35szPYcc6

@erikdesjardins
Copy link
Contributor

Looks like the opt-level dependence is because there's no argpromotion at opt-level=2 (https://rust.godbolt.org/z/f44xTqnG4), so my hypothesis was right. (but this is no solace for real code, as per #115242 (comment))

but preserve_mostcc is still described as "experimental" in langref

I wouldn't worry too much about this. Unlike coldcc, preserve_mostcc is actually exposed/used in Clang: https://godbolt.org/z/cv7E3ej8M. With that, and the known bugs in coldcc, I'd say preserve_mostcc is the less experimental of the two.

@scottmcm
Copy link
Member Author

scottmcm commented Aug 26, 2023

Ah, thanks for the pointer, @erikdesjardins .

It originally looked horrible locally, but it seems that's just because Windows is sadness -- I opened https://discourse.llvm.org/t/conv-c-and-conv-preservemost-mix-badly-on-windows-x64/73054?u=scottmcm to ask about it -- and if I look at calls from the SysV ABI instead it works great.

I'll try to turn that into a PR of some sort. Might need new target data so windows can know not to use it.

EDIT: Ah, looking at this I think I did the convention add PR wrong last time (#97512). PR up where I found the right place to put the "don't do this on windows" conversion: #115260

bors added a commit to rust-lang-ci/rust that referenced this issue Aug 29, 2023
…Lapkin

Use `preserve_mostcc` for `extern "rust-cold"`

As experimentation in rust-lang#115242 has shown looks better than `coldcc`.  Notably, clang exposes `preserve_most` (https://clang.llvm.org/docs/AttributeReference.html#preserve-most) but not `cold`, so this change should put us on a better-supported path.

And *don't* use a different convention for cold on Windows, because that actually ends up making things worse. (See comment in the code.)

cc tracking issue rust-lang#97544
@Noratrieb Noratrieb added I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Sep 4, 2023
@the8472
Copy link
Member

the8472 commented Sep 28, 2023

Since the preserve_mostcc PR has landed I tried to apply #115256 locally but it doesn't appear to work. The register still gets spilled

vec_push:
        .cfi_startproc
        subq    $24, %rsp
        .cfi_def_cfa_offset 32
        movups  (%rsi), %xmm0
        movq    16(%rdi), %rsi
        cmpq    8(%rdi), %rsi
        jne     .LBB2_2
        movaps  %xmm0, (%rsp)
        callq   _ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17hf9b0ac0abb2ba3edE
        movaps  (%rsp), %xmm0
        movq    16(%rdi), %rsi
.LBB2_2:
        movq    (%rdi), %rax
        movq    %rsi, %rcx
        shlq    $4, %rcx
        movups  %xmm0, (%rax,%rcx)
        incq    %rsi
        movq    %rsi, 16(%rdi)
        addq    $24, %rsp
        .cfi_def_cfa_offset 8
        retq
define internal preserve_mostcc void @"_ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17hf9b0ac0abb2ba3edE"
[...]

; Function Attrs: nonlazybind uwtable
define void @vec_push(ptr noalias nocapture noundef align 8 dereferenceable(24) %vec, ptr noalias nocapture noundef readonly align 1 dereferenceable(16) %data) unnamed_addr #2 personality ptr @rust_eh_personality {
start:
  %data.val = load <16 x i8>, ptr %data, align 1
  tail call void @llvm.experimental.noalias.scope.decl(metadata !18)
  %0 = getelementptr inbounds %"alloc::vec::Vec<[u8; 16]>", ptr %vec, i64 0, i32 1
  %_4.i = load i64, ptr %0, align 8, !alias.scope !18, !noundef !4
  %1 = getelementptr inbounds { ptr, i64 }, ptr %vec, i64 0, i32 1
  %2 = load i64, ptr %1, align 8, !alias.scope !18, !noundef !4
  %_3.i = icmp eq i64 %_4.i, %2
  br i1 %_3.i, label %bb1.i, label %"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17ha4b516e01385232bE.exit"

bb1.i:                                            ; preds = %start
; call alloc::raw_vec::RawVec<T,A>::reserve_for_push
  tail call preserve_mostcc void @"_ZN5alloc7raw_vec19RawVec$LT$T$C$A$GT$16reserve_for_push17hf9b0ac0abb2ba3edE"(ptr noalias noundef nonnull align 8 dereferenceable(16) %vec, i64 noundef %_4.i)
  %count.pre.i = load i64, ptr %0, align 8, !alias.scope !18
  br label %"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17ha4b516e01385232bE.exit"

"_ZN5alloc3vec16Vec$LT$T$C$A$GT$4push17ha4b516e01385232bE.exit": ; preds = %start, %bb1.i
  %count.i = phi i64 [ %count.pre.i, %bb1.i ], [ %_4.i, %start ]
  %self1.i = load ptr, ptr %vec, align 8, !alias.scope !18, !nonnull !4, !noundef !4
  %end.i = getelementptr inbounds [16 x i8], ptr %self1.i, i64 %count.i
  store <16 x i8> %data.val, ptr %end.i, align 1, !noalias !18
  %3 = add i64 %count.i, 1
  store i64 %3, ptr %0, align 8, !alias.scope !18
  ret void
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants