Noop-loop is no longer optimized away on Nightly #121239

the8472 · 2024-02-17T16:34:24Z

While working on #120682 I noticed that some optimizations regressed after a rebase. The following should be a reduction.

I tried this code:

godbolt

#![feature(ptr_sub_ptr)]
#![feature(unchecked_math)]
#![feature(non_null_convenience)]

use std::ptr::NonNull;
use std::ptr;
use std::mem;

struct It<T> {
    start: NonNull<T>,
    end: *const T,
}

struct Out {
    a: u64,
    b: u64,
    c: u64,
    d: u64,
}

pub fn do_it(ptr: *mut [u64; 4], len: usize) -> It<Out> {
    unsafe {
        let start = NonNull::new_unchecked(ptr);
        let mut it = It::<[u64; 4]> {
            start,
            end: start.add(len).as_ptr()
        };
        end_offset_copy(&mut it)
    }
}

fn end_offset_copy(a: &mut It<[u64; 4]>) -> It<Out> {
    unsafe {
        let len = a.end.sub_ptr(a.start.as_ptr());

        let mut i = len;

        let dest_buf: *mut Out = a.start.as_ptr().cast();

        while i > 0 {
            let val = a.end.sub(i).read();
            let dst = dest_buf.add(len).sub(i);

            ptr::write(dst, mem::transmute(val));

            i = i.unchecked_sub(1); 
        }

        let start = NonNull::new_unchecked(dest_buf);

        It::<Out> {
            start,
            end: start.add(len).as_ptr()
        }
    }
}

I expected to see this happen (beta):

example::do_it:
        mov     rax, rdi
        shl     rsi, 5
        lea     rdx, [rsi + rdi]
        ret

Instead, this happened (nightly):

example::do_it:
        mov     rax, rdi
        movabs  rdx, 576460752303423487
        and     rdx, rsi
        test    rsi, rsi
        je      .LBB0_6
        mov     rcx, rsi
        shl     rcx, 5
        mov     rdi, rdx
        test    sil, 1
        je      .LBB0_3
        lea     rsi, [rax + rcx]
        mov     rdi, rdx
        shl     rdi, 5
        sub     rsi, rdi
        movups  xmm0, xmmword ptr [rsi]
        movups  xmm1, xmmword ptr [rsi + 16]
        movups  xmmword ptr [rax + 16], xmm1
        movups  xmmword ptr [rax], xmm0
        lea     rdi, [rdx - 1]
.LBB0_3:
        cmp     rdx, 1
        je      .LBB0_6
        mov     r8, rdi
        shl     r8, 5
        mov     rsi, rax
        sub     rsi, r8
        add     rsi, 32
        mov     r8, rdx
        shl     r8, 5
.LBB0_5:
        movups  xmm0, xmmword ptr [rsi + rcx - 32]
        movups  xmm1, xmmword ptr [rsi + rcx - 16]
        movups  xmmword ptr [rsi + r8 - 16], xmm1
        movups  xmmword ptr [rsi + r8 - 32], xmm0
        movups  xmm0, xmmword ptr [rsi + rcx]
        movups  xmm1, xmmword ptr [rsi + rcx + 16]
        movups  xmmword ptr [rsi + r8 + 16], xmm1
        movups  xmmword ptr [rsi + r8], xmm0
        add     rsi, 64
        add     rdi, -2
        jne     .LBB0_5
.LBB0_6:
        shl     rdx, 5
        add     rdx, rax
        ret

Meta

rustc --version --verbose:

rustc 1.78.0-nightly (bccb9bbb4 2024-02-16)
binary: rustc
commit-hash: bccb9bbb418a30aeb332052e721beb6ebc6b1ce7
commit-date: 2024-02-16
host: x86_64-unknown-linux-gnu
release: 1.78.0-nightly
LLVM version: 18.1.0
Compiler returned: 0

The text was updated successfully, but these errors were encountered:

matthiaskrgr · 2024-02-17T18:01:49Z

I was able to "fix" the nightly codegen to be identical to beta by adding #[inline(always)] to backwards_copy
Probably some inlining change in llvm 18 ? 🤔

matthiaskrgr · 2024-02-17T18:02:28Z

cc @nikic

apiraino · 2024-02-19T11:07:42Z

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-medium

the8472 · 2024-02-19T20:54:08Z

Odd, changing

        let len = a.end.sub_ptr(a.start.as_ptr());

to

        let len = a.end.offset_from(a.start.as_ptr()) as usize;

fixes it. But sub_ptr is supposed to have strictly more UB, i.e. let LLVM optimize more.

nikic · 2024-02-20T10:02:38Z

I think the the direct cause here is a correctness fix in LLVM 18 which ends up stripping more poison flags during certain loop transformations, and ends up (likely unnecessarily) dropping exact here.

The reason why offset_from helps is that it allows making use of the exact flag earlier by transforming a gep %ty, p, div_exact(off, sizeof(%ty)) into gep i8, p, off. This transform requires that the exact operation be sdiv or ashr -- presumably because GEP offsets are signed.

But this is not actually required (https://alive2.llvm.org/ce/z/gbbYmE). It works just as well with unsigned exact division. I'll see about fixing that.

nikic · 2024-02-20T10:14:14Z

Upstream patch: llvm/llvm-project#82334

nikic · 2024-02-20T10:21:16Z

I think the the direct cause here is a correctness fix in LLVM 18 which ends up stripping more poison flags during certain loop transformations, and ends up (likely unnecessarily) dropping exact here.

Now this is just rude. Apparently LoopIdiomRecognize strips the flags -- without actually performing any transform. I think LIR does some speculative expansion and then drops instructions again, but this doesn't recover flag changes.

nikic · 2024-02-20T10:42:48Z

Upstream issue for LIR: llvm/llvm-project#82337

Update to LLVM 18.1.0 rc 3 Fixes rust-lang#120819. Fixes rust-lang#121180. Fixes rust-lang#121239. Fixes rust-lang#121367.

the8472 · 2024-03-06T14:43:35Z

Looks like my reduction was insufficient. The example on godbolt does optimize now but the original code that I was working on still fails with the sub nuw but optimizes properly with sub.

I'll extract the IR from a codegen test.

the8472 · 2024-03-06T14:55:59Z

unoptimized IR, using sub_ptr: sub_ptr.ll.txt

optimized IR, using sub_ptr, fails to eliminate loop

; ModuleID = 'vec_in_place2.ce5efceb0ce65a83-cgu.0'
source_filename = "vec_in_place2.ce5efceb0ce65a83-cgu.0"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%"alloc::vec::Vec<Foo>" = type { %"alloc::raw_vec::RawVec<Foo>", i64 }
%"alloc::raw_vec::RawVec<Foo>" = type { i64, ptr, %"alloc::alloc::Global" }
%"alloc::alloc::Global" = type {}
%Foo = type { i64, i64, i64, i64 }

; Function Attrs: nofree norecurse nosync nounwind nonlazybind uwtable
define void @vec_iterator_cast_aggregate(ptr noalias nocapture noundef writeonly sret(%"alloc::vec::Vec<Foo>") align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %vec) unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  %me.sroa.0.0.copyload.i = load i64, ptr %vec, align 8, !alias.scope !3, !noalias !6
  %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %vec, i64 8
  %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8, !alias.scope !3, !noalias !6, !nonnull !8, !noundef !8
  %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %vec, i64 16
  %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8, !alias.scope !3, !noalias !6
  %_19.i = getelementptr inbounds [4 x i64], ptr %me.sroa.4.0.copyload.i, i64 %me.sroa.5.0.copyload.i
  tail call void @llvm.experimental.noalias.scope.decl(metadata !9)
  %_19.i.idx = and i64 %me.sroa.5.0.copyload.i, 576460752303423487
  %0 = icmp eq i64 %me.sroa.5.0.copyload.i, 0
  br i1 %0, label %"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E.exit", label %bb4.i.i.preheader

bb4.i.i.preheader:                                ; preds = %start
  %xtraiter = and i64 %me.sroa.5.0.copyload.i, 1
  %1 = icmp eq i64 %_19.i.idx, 1
  br i1 %1, label %"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E.exit.loopexit.unr-lcssa", label %bb4.i.i.preheader.new

bb4.i.i.preheader.new:                            ; preds = %bb4.i.i.preheader
  %unroll_iter = sub nsw i64 %_19.i.idx, %xtraiter
  br label %bb4.i.i

bb4.i.i:                                          ; preds = %bb4.i.i, %bb4.i.i.preheader.new
  %state.sroa.5.0.i.i = phi i64 [ 0, %bb4.i.i.preheader.new ], [ %_10.i.i.1, %bb4.i.i ]
  %niter = phi i64 [ 0, %bb4.i.i.preheader.new ], [ %niter.next.1, %bb4.i.i ]
  %_10.i.i = or disjoint i64 %state.sroa.5.0.i.i, 1
  %_14.neg.i.i = sub i64 %state.sroa.5.0.i.i, %_19.i.idx
  %_18.i.i.i.i.i = getelementptr inbounds [4 x i64], ptr %_19.i, i64 %_14.neg.i.i
  %2 = load <4 x i64>, ptr %_18.i.i.i.i.i, align 8, !noalias !12
  %dst.i.i = getelementptr inbounds %Foo, ptr %me.sroa.4.0.copyload.i, i64 %state.sroa.5.0.i.i
  store <4 x i64> %2, ptr %dst.i.i, align 8, !noalias !25
  %_10.i.i.1 = add nuw i64 %state.sroa.5.0.i.i, 2
  %_14.neg.i.i.1 = sub i64 %_10.i.i, %_19.i.idx
  %_18.i.i.i.i.i.1 = getelementptr inbounds [4 x i64], ptr %_19.i, i64 %_14.neg.i.i.1
  %3 = load <4 x i64>, ptr %_18.i.i.i.i.i.1, align 8, !noalias !12
  %dst.i.i.1 = getelementptr inbounds %Foo, ptr %me.sroa.4.0.copyload.i, i64 %_10.i.i
  store <4 x i64> %3, ptr %dst.i.i.1, align 8, !noalias !25
  %niter.next.1 = add i64 %niter, 2
  %niter.ncmp.1 = icmp eq i64 %niter.next.1, %unroll_iter
  br i1 %niter.ncmp.1, label %"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E.exit.loopexit.unr-lcssa", label %bb4.i.i

"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E.exit.loopexit.unr-lcssa": ; preds = %bb4.i.i, %bb4.i.i.preheader
  %state.sroa.5.0.i.i.unr = phi i64 [ 0, %bb4.i.i.preheader ], [ %_10.i.i.1, %bb4.i.i ]
  %lcmp.mod.not = icmp eq i64 %xtraiter, 0
  br i1 %lcmp.mod.not, label %"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E.exit", label %bb4.i.i.epil

bb4.i.i.epil:                                     ; preds = %"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E.exit.loopexit.unr-lcssa"
  %_14.neg.i.i.epil = sub i64 %state.sroa.5.0.i.i.unr, %_19.i.idx
  %_18.i.i.i.i.i.epil = getelementptr inbounds [4 x i64], ptr %_19.i, i64 %_14.neg.i.i.epil
  %4 = load <4 x i64>, ptr %_18.i.i.i.i.i.epil, align 8, !noalias !12
  %dst.i.i.epil = getelementptr inbounds %Foo, ptr %me.sroa.4.0.copyload.i, i64 %state.sroa.5.0.i.i.unr
  store <4 x i64> %4, ptr %dst.i.i.epil, align 8, !noalias !25
  br label %"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E.exit"

"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E.exit": ; preds = %bb4.i.i.epil, %"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E.exit.loopexit.unr-lcssa", %start
  %_18.i = and i64 %me.sroa.0.0.copyload.i, 576460752303423487
  store i64 %_18.i, ptr %_0, align 8, !alias.scope !9, !noalias !26
  %vec.sroa.4.0._0.sroa_idx.i = getelementptr inbounds i8, ptr %_0, i64 8
  store ptr %me.sroa.4.0.copyload.i, ptr %vec.sroa.4.0._0.sroa_idx.i, align 8, !alias.scope !9, !noalias !26
  %vec.sroa.5.0._0.sroa_idx.i = getelementptr inbounds i8, ptr %_0, i64 16
  store i64 %_19.i.idx, ptr %vec.sroa.5.0._0.sroa_idx.i, align 8, !alias.scope !9, !noalias !26
  ret void
}

; Function Attrs: nonlazybind uwtable
declare noundef i32 @rust_eh_personality(i32 noundef, i32 noundef, i64 noundef, ptr noundef, ptr noundef) unnamed_addr #1

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(inaccessiblemem: readwrite)
declare void @llvm.experimental.noalias.scope.decl(metadata) #2

attributes #0 = { nofree norecurse nosync nounwind nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #1 = { nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #2 = { nocallback nofree nosync nounwind willreturn memory(inaccessiblemem: readwrite) }

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 8, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}
!2 = !{!"rustc version 1.78.0-dev"}
!3 = !{!4}
!4 = distinct !{!4, !5, !"_ZN90_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..iter..traits..collect..IntoIterator$GT$9into_iter17he352ce3d58cff200E: %self"}
!5 = distinct !{!5, !"_ZN90_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..iter..traits..collect..IntoIterator$GT$9into_iter17he352ce3d58cff200E"}
!6 = !{!7}
!7 = distinct !{!7, !5, !"_ZN90_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..iter..traits..collect..IntoIterator$GT$9into_iter17he352ce3d58cff200E: %_0"}
!8 = !{}
!9 = !{!10}
!10 = distinct !{!10, !11, !"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E: %_0"}
!11 = distinct !{!11, !"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E"}
!12 = !{!13, !15, !16, !18, !19, !21, !22, !10, !24}
!13 = distinct !{!13, !14, !"_ZN103_$LT$alloc..vec..into_iter..IntoIter$LT$T$C$A$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$24index_from_end_unchecked17h43b8872b965e6d69E: %_0"}
!14 = distinct !{!14, !"_ZN103_$LT$alloc..vec..into_iter..IntoIter$LT$T$C$A$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$24index_from_end_unchecked17h43b8872b965e6d69E"}
!15 = distinct !{!15, !14, !"_ZN103_$LT$alloc..vec..into_iter..IntoIter$LT$T$C$A$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$24index_from_end_unchecked17h43b8872b965e6d69E: %self"}
!16 = distinct !{!16, !17, !"_ZN79_$LT$I$u20$as$u20$core..iter..traits..unchecked_iterator..SpecIndexedAccess$GT$30index_from_end_unchecked_inner17h727d6efa915e51e5E: %_0"}
!17 = distinct !{!17, !"_ZN79_$LT$I$u20$as$u20$core..iter..traits..unchecked_iterator..SpecIndexedAccess$GT$30index_from_end_unchecked_inner17h727d6efa915e51e5E"}
!18 = distinct !{!18, !17, !"_ZN79_$LT$I$u20$as$u20$core..iter..traits..unchecked_iterator..SpecIndexedAccess$GT$30index_from_end_unchecked_inner17h727d6efa915e51e5E: %self"}
!19 = distinct !{!19, !20, !"_ZN102_$LT$core..iter..adapters..map..Map$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$24index_from_end_unchecked17hfc88e8970d8e3af0E: %_0"}
!20 = distinct !{!20, !"_ZN102_$LT$core..iter..adapters..map..Map$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$24index_from_end_unchecked17hfc88e8970d8e3af0E"}
!21 = distinct !{!21, !20, !"_ZN102_$LT$core..iter..adapters..map..Map$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$24index_from_end_unchecked17hfc88e8970d8e3af0E: %self"}
!22 = distinct !{!22, !23, !"_ZN83_$LT$I$u20$as$u20$alloc..vec..in_place_collect..SpecInPlaceCollect$LT$T$C$I$GT$$GT$16collect_in_place17h56b8fd95ce366a3bE: argument 0"}
!23 = distinct !{!23, !"_ZN83_$LT$I$u20$as$u20$alloc..vec..in_place_collect..SpecInPlaceCollect$LT$T$C$I$GT$$GT$16collect_in_place17h56b8fd95ce366a3bE"}
!24 = distinct !{!24, !11, !"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E: %iterator"}
!25 = !{!22, !10, !24}
!26 = !{!24}

optimized IR, using offset_from, eliminates the loop

; ModuleID = 'vec_in_place2.ce5efceb0ce65a83-cgu.0'
source_filename = "vec_in_place2.ce5efceb0ce65a83-cgu.0"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%"alloc::vec::Vec<Foo>" = type { %"alloc::raw_vec::RawVec<Foo>", i64 }
%"alloc::raw_vec::RawVec<Foo>" = type { i64, ptr, %"alloc::alloc::Global" }
%"alloc::alloc::Global" = type {}

; Function Attrs: nofree norecurse nosync nounwind nonlazybind memory(argmem: readwrite) uwtable
define void @vec_iterator_cast_aggregate(ptr noalias nocapture noundef writeonly sret(%"alloc::vec::Vec<Foo>") align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %vec) unnamed_addr #0 personality ptr @rust_eh_personality {
start:
  %me.sroa.0.0.copyload.i = load i64, ptr %vec, align 8, !alias.scope !3, !noalias !6
  %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %vec, i64 8
  %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8, !alias.scope !3, !noalias !6, !nonnull !8, !noundef !8
  %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %vec, i64 16
  %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8, !alias.scope !3, !noalias !6
  %_18.i = and i64 %me.sroa.0.0.copyload.i, 576460752303423487
  store i64 %_18.i, ptr %_0, align 8, !alias.scope !9, !noalias !12
  %vec.sroa.4.0._0.sroa_idx.i = getelementptr inbounds i8, ptr %_0, i64 8
  store ptr %me.sroa.4.0.copyload.i, ptr %vec.sroa.4.0._0.sroa_idx.i, align 8, !alias.scope !9, !noalias !12
  %vec.sroa.5.0._0.sroa_idx.i = getelementptr inbounds i8, ptr %_0, i64 16
  store i64 %me.sroa.5.0.copyload.i, ptr %vec.sroa.5.0._0.sroa_idx.i, align 8, !alias.scope !9, !noalias !12
  ret void
}

; Function Attrs: nonlazybind uwtable
declare noundef i32 @rust_eh_personality(i32 noundef, i32 noundef, i64 noundef, ptr noundef, ptr noundef) unnamed_addr #1

attributes #0 = { nofree norecurse nosync nounwind nonlazybind memory(argmem: readwrite) uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }
attributes #1 = { nonlazybind uwtable "probe-stack"="inline-asm" "target-cpu"="x86-64" }

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 8, !"PIC Level", i32 2}
!1 = !{i32 2, !"RtLibUseGOT", i32 1}
!2 = !{!"rustc version 1.78.0-dev"}
!3 = !{!4}
!4 = distinct !{!4, !5, !"_ZN90_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..iter..traits..collect..IntoIterator$GT$9into_iter17he352ce3d58cff200E: %self"}
!5 = distinct !{!5, !"_ZN90_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..iter..traits..collect..IntoIterator$GT$9into_iter17he352ce3d58cff200E"}
!6 = !{!7}
!7 = distinct !{!7, !5, !"_ZN90_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..iter..traits..collect..IntoIterator$GT$9into_iter17he352ce3d58cff200E: %_0"}
!8 = !{}
!9 = !{!10}
!10 = distinct !{!10, !11, !"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E: %_0"}
!11 = distinct !{!11, !"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E"}
!12 = !{!13}
!13 = distinct !{!13, !11, !"_ZN5alloc3vec16in_place_collect108_$LT$impl$u20$alloc..vec..spec_from_iter..SpecFromIter$LT$T$C$I$GT$$u20$for$u20$alloc..vec..Vec$LT$T$GT$$GT$9from_iter17h3c7c786b418efdf9E: %iterator"}

nikic · 2024-03-06T14:57:41Z

@the8472 Can you please provide a Rust test case or unoptimized IR?

the8472 · 2024-03-06T14:58:55Z

The text file above should contain unoptimized IR.

nikic · 2024-03-06T15:00:10Z

Oh sorry, totally missed the first line...

nikic · 2024-03-06T15:03:05Z

This optimizes with LLVM 19, presumably due to the fix for llvm/llvm-project#82337. But this is too risky to backport.

LLVM does not know that the multiplication never overflows, which causes it to generate unnecessary instructions. Use `usize::unchecked_mul`, so that it can fold the `dst_cap` calculation when `size_of::<I::SRC>() == size_of::<T>()`. Running: ``` rustc -C llvm-args=-x86-asm-syntax=intel -O src/lib.rs --emit asm` ``` ```rust pub struct Foo([usize; 3]); pub fn unwrap_copy(v: Vec<Foo>) -> Vec<[usize; 3]> { v.into_iter().map(|f| f.0).collect() } ``` Before this commit: ``` define void @unwrap_copy(ptr noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 %_16.i.i = mul i64 %me.sroa.0.0.copyload.i, 24 %dst_cap.i.i = udiv i64 %_16.i.i, 24 store i64 %dst_cap.i.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8 ret void } ``` After: ``` define void @unwrap_copy(ptr noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 store i64 %me.sroa.0.0.copyload.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8, !alias.scope !9, !noalias !14 ret void } ``` Note that there is still one more `mul,udiv` pair that I couldn't get rid of. The root cause is the same issue as rust-lang#121239, the `nuw` gets stripped off of `ptr::sub_ptr`.

optimize inplace collection of Vec This PR has the following changes: 1. Using `usize::unchecked_mul` in https://github.com/rust-lang/rust/blob/79424056b05eaa9563d16dfab9b9a0c8f033f220/library/alloc/src/vec/in_place_collect.rs#L262 as LLVM, does not know that the operation can't wrap, since that's the size of the original allocation. Given the following: ```rust pub struct Foo([usize; 3]); pub fn unwrap_copy(v: Vec<Foo>) -> Vec<[usize; 3]> { v.into_iter().map(|f| f.0).collect() } ``` <details> <summary>Before this commit:</summary> ```llvm define void `@unwrap_copy(ptr` noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 ; Unnecessary calculation %_16.i.i = mul i64 %me.sroa.0.0.copyload.i, 24 %dst_cap.i.i = udiv i64 %_16.i.i, 24 store i64 %dst_cap.i.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8 ret void } ``` </details> <details> <summary>After:</summary> ```llvm define void `@unwrap_copy(ptr` noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 store i64 %me.sroa.0.0.copyload.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8, !alias.scope !9, !noalias !14 ret void } ``` </details> Note that there is still one more `mul,udiv` pair that I couldn't get rid of. The root cause is the same issue as rust-lang#121239, the `nuw` gets stripped off of `ptr::sub_ptr`. 2. `Iterator::try_fold` gets called on the underlying Iterator in `SpecInPlaceCollect::collect_in_place` whenever it does not implement `TrustedRandomAccess`. For types that impl `Drop`, LLVM currently can't tell that the drop can never occur, when using the default `Iterator::try_fold` implementation. For example, given the following code from rust-lang#120493 ```rust #[repr(transparent)] struct WrappedClone { inner: String } #[no_mangle] pub fn unwrap_clone(list: Vec<WrappedClone>) -> Vec<String> { list.into_iter().map(|s| s.inner).collect() } ``` <details> <summary>The asm for the `unwrap_clone` method is currently:</summary> ```asm unwrap_clone: push rbp push r15 push r14 push r13 push r12 push rbx push rax mov rbx, rdi mov r12, qword ptr [rsi] mov rdi, qword ptr [rsi + 8] mov rax, qword ptr [rsi + 16] movabs rsi, -6148914691236517205 mov r14, r12 test rax, rax je .LBB0_10 lea rcx, [rax + 2*rax] lea r14, [r12 + 8*rcx] shl rax, 3 lea rax, [rax + 2*rax] xor ecx, ecx .LBB0_2: cmp qword ptr [r12 + rcx], 0 je .LBB0_4 add rcx, 24 cmp rax, rcx jne .LBB0_2 jmp .LBB0_10 .LBB0_4: lea rdx, [rax - 24] lea r14, [r12 + rcx] cmp rdx, rcx je .LBB0_10 mov qword ptr [rsp], rdi sub rax, rcx add rax, -24 mul rsi mov r15, rdx lea rbp, [r12 + rcx] add rbp, 32 shr r15, 4 mov r13, qword ptr [rip + __rust_dealloc@GOTPCREL] jmp .LBB0_6 .LBB0_8: add rbp, 24 dec r15 je .LBB0_9 .LBB0_6: mov rsi, qword ptr [rbp] test rsi, rsi je .LBB0_8 mov rdi, qword ptr [rbp - 8] mov edx, 1 call r13 jmp .LBB0_8 .LBB0_9: mov rdi, qword ptr [rsp] movabs rsi, -6148914691236517205 .LBB0_10: sub r14, r12 mov rax, r14 mul rsi shr rdx, 4 mov qword ptr [rbx], r12 mov qword ptr [rbx + 8], rdi mov qword ptr [rbx + 16], rdx mov rax, rbx add rsp, 8 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret ``` </details> <details> <summary>After this PR:</summary> ```asm unwrap_clone: mov rax, rdi movups xmm0, xmmword ptr [rsi] mov rcx, qword ptr [rsi + 16] movups xmmword ptr [rdi], xmm0 mov qword ptr [rdi + 16], rcx ret ``` </details> Fixes rust-lang#120493

LLVM does not know that the multiplication never overflows, which causes it to generate unnecessary instructions. Use `usize::unchecked_mul`, so that it can fold the `dst_cap` calculation when `size_of::<I::SRC>() == size_of::<T>()`. Running: ``` rustc -C llvm-args=-x86-asm-syntax=intel -O src/lib.rs --emit asm` ``` ```rust pub struct Foo([usize; 3]); pub fn unwrap_copy(v: Vec<Foo>) -> Vec<[usize; 3]> { v.into_iter().map(|f| f.0).collect() } ``` Before this commit: ``` define void @unwrap_copy(ptr noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 %_16.i.i = mul i64 %me.sroa.0.0.copyload.i, 24 %dst_cap.i.i = udiv i64 %_16.i.i, 24 store i64 %dst_cap.i.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8 ret void } ``` After: ``` define void @unwrap_copy(ptr noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 store i64 %me.sroa.0.0.copyload.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8, !alias.scope !9, !noalias !14 ret void } ``` Note that there is still one more `mul,udiv` pair that I couldn't get rid of. The root cause is the same issue as rust-lang#121239, the `nuw` gets stripped off of `ptr::sub_ptr`.

optimize inplace collection of Vec This PR has the following changes: 1. Using `usize::unchecked_mul` in https://github.com/rust-lang/rust/blob/79424056b05eaa9563d16dfab9b9a0c8f033f220/library/alloc/src/vec/in_place_collect.rs#L262 as LLVM, does not know that the operation can't wrap, since that's the size of the original allocation. Given the following: ```rust pub struct Foo([usize; 3]); pub fn unwrap_copy(v: Vec<Foo>) -> Vec<[usize; 3]> { v.into_iter().map(|f| f.0).collect() } ``` <details> <summary>Before this commit:</summary> ```llvm define void `@unwrap_copy(ptr` noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 ; Unnecessary calculation %_16.i.i = mul i64 %me.sroa.0.0.copyload.i, 24 %dst_cap.i.i = udiv i64 %_16.i.i, 24 store i64 %dst_cap.i.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8 ret void } ``` </details> <details> <summary>After:</summary> ```llvm define void `@unwrap_copy(ptr` noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 store i64 %me.sroa.0.0.copyload.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8, !alias.scope !9, !noalias !14 ret void } ``` </details> Note that there is still one more `mul,udiv` pair that I couldn't get rid of. The root cause is the same issue as rust-lang#121239, the `nuw` gets stripped off of `ptr::sub_ptr`. 2. `Iterator::try_fold` gets called on the underlying Iterator in `SpecInPlaceCollect::collect_in_place` whenever it does not implement `TrustedRandomAccess`. For types that impl `Drop`, LLVM currently can't tell that the drop can never occur, when using the default `Iterator::try_fold` implementation. For example, given the following code from rust-lang#120493 ```rust #[repr(transparent)] struct WrappedClone { inner: String } #[no_mangle] pub fn unwrap_clone(list: Vec<WrappedClone>) -> Vec<String> { list.into_iter().map(|s| s.inner).collect() } ``` <details> <summary>The asm for the `unwrap_clone` method is currently:</summary> ```asm unwrap_clone: push rbp push r15 push r14 push r13 push r12 push rbx push rax mov rbx, rdi mov r12, qword ptr [rsi] mov rdi, qword ptr [rsi + 8] mov rax, qword ptr [rsi + 16] movabs rsi, -6148914691236517205 mov r14, r12 test rax, rax je .LBB0_10 lea rcx, [rax + 2*rax] lea r14, [r12 + 8*rcx] shl rax, 3 lea rax, [rax + 2*rax] xor ecx, ecx .LBB0_2: cmp qword ptr [r12 + rcx], 0 je .LBB0_4 add rcx, 24 cmp rax, rcx jne .LBB0_2 jmp .LBB0_10 .LBB0_4: lea rdx, [rax - 24] lea r14, [r12 + rcx] cmp rdx, rcx je .LBB0_10 mov qword ptr [rsp], rdi sub rax, rcx add rax, -24 mul rsi mov r15, rdx lea rbp, [r12 + rcx] add rbp, 32 shr r15, 4 mov r13, qword ptr [rip + __rust_dealloc@GOTPCREL] jmp .LBB0_6 .LBB0_8: add rbp, 24 dec r15 je .LBB0_9 .LBB0_6: mov rsi, qword ptr [rbp] test rsi, rsi je .LBB0_8 mov rdi, qword ptr [rbp - 8] mov edx, 1 call r13 jmp .LBB0_8 .LBB0_9: mov rdi, qword ptr [rsp] movabs rsi, -6148914691236517205 .LBB0_10: sub r14, r12 mov rax, r14 mul rsi shr rdx, 4 mov qword ptr [rbx], r12 mov qword ptr [rbx + 8], rdi mov qword ptr [rbx + 16], rdx mov rax, rbx add rsp, 8 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret ``` </details> <details> <summary>After this PR:</summary> ```asm unwrap_clone: mov rax, rdi movups xmm0, xmmword ptr [rsi] mov rcx, qword ptr [rsi + 16] movups xmmword ptr [rdi], xmm0 mov qword ptr [rdi + 16], rcx ret ``` </details> Fixes rust-lang#120493

optimize inplace collection of Vec This PR has the following changes: 1. Using `usize::unchecked_mul` in https://github.com/rust-lang/rust/blob/79424056b05eaa9563d16dfab9b9a0c8f033f220/library/alloc/src/vec/in_place_collect.rs#L262 as LLVM, does not know that the operation can't wrap, since that's the size of the original allocation. Given the following: ```rust pub struct Foo([usize; 3]); pub fn unwrap_copy(v: Vec<Foo>) -> Vec<[usize; 3]> { v.into_iter().map(|f| f.0).collect() } ``` <details> <summary>Before this commit:</summary> ```llvm define void `@unwrap_copy(ptr` noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 ; Unnecessary calculation %_16.i.i = mul i64 %me.sroa.0.0.copyload.i, 24 %dst_cap.i.i = udiv i64 %_16.i.i, 24 store i64 %dst_cap.i.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8 ret void } ``` </details> <details> <summary>After:</summary> ```llvm define void `@unwrap_copy(ptr` noalias nocapture noundef writeonly sret([24 x i8]) align 8 dereferenceable(24) %_0, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %iter) { start: %me.sroa.0.0.copyload.i = load i64, ptr %iter, align 8 %me.sroa.4.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 8 %me.sroa.4.0.copyload.i = load ptr, ptr %me.sroa.4.0.self.sroa_idx.i, align 8 %me.sroa.5.0.self.sroa_idx.i = getelementptr inbounds i8, ptr %iter, i64 16 %me.sroa.5.0.copyload.i = load i64, ptr %me.sroa.5.0.self.sroa_idx.i, align 8 %_19.i.idx = mul nsw i64 %me.sroa.5.0.copyload.i, 24 %0 = udiv i64 %_19.i.idx, 24 store i64 %me.sroa.0.0.copyload.i, ptr %_0, align 8 %1 = getelementptr inbounds i8, ptr %_0, i64 8 store ptr %me.sroa.4.0.copyload.i, ptr %1, align 8 %2 = getelementptr inbounds i8, ptr %_0, i64 16 store i64 %0, ptr %2, align 8, !alias.scope !9, !noalias !14 ret void } ``` </details> Note that there is still one more `mul,udiv` pair that I couldn't get rid of. The root cause is the same issue as rust-lang/rust#121239, the `nuw` gets stripped off of `ptr::sub_ptr`. 2. `Iterator::try_fold` gets called on the underlying Iterator in `SpecInPlaceCollect::collect_in_place` whenever it does not implement `TrustedRandomAccess`. For types that impl `Drop`, LLVM currently can't tell that the drop can never occur, when using the default `Iterator::try_fold` implementation. For example, given the following code from #120493 ```rust #[repr(transparent)] struct WrappedClone { inner: String } #[no_mangle] pub fn unwrap_clone(list: Vec<WrappedClone>) -> Vec<String> { list.into_iter().map(|s| s.inner).collect() } ``` <details> <summary>The asm for the `unwrap_clone` method is currently:</summary> ```asm unwrap_clone: push rbp push r15 push r14 push r13 push r12 push rbx push rax mov rbx, rdi mov r12, qword ptr [rsi] mov rdi, qword ptr [rsi + 8] mov rax, qword ptr [rsi + 16] movabs rsi, -6148914691236517205 mov r14, r12 test rax, rax je .LBB0_10 lea rcx, [rax + 2*rax] lea r14, [r12 + 8*rcx] shl rax, 3 lea rax, [rax + 2*rax] xor ecx, ecx .LBB0_2: cmp qword ptr [r12 + rcx], 0 je .LBB0_4 add rcx, 24 cmp rax, rcx jne .LBB0_2 jmp .LBB0_10 .LBB0_4: lea rdx, [rax - 24] lea r14, [r12 + rcx] cmp rdx, rcx je .LBB0_10 mov qword ptr [rsp], rdi sub rax, rcx add rax, -24 mul rsi mov r15, rdx lea rbp, [r12 + rcx] add rbp, 32 shr r15, 4 mov r13, qword ptr [rip + __rust_dealloc@GOTPCREL] jmp .LBB0_6 .LBB0_8: add rbp, 24 dec r15 je .LBB0_9 .LBB0_6: mov rsi, qword ptr [rbp] test rsi, rsi je .LBB0_8 mov rdi, qword ptr [rbp - 8] mov edx, 1 call r13 jmp .LBB0_8 .LBB0_9: mov rdi, qword ptr [rsp] movabs rsi, -6148914691236517205 .LBB0_10: sub r14, r12 mov rax, r14 mul rsi shr rdx, 4 mov qword ptr [rbx], r12 mov qword ptr [rbx + 8], rdi mov qword ptr [rbx + 16], rdx mov rax, rbx add rsp, 8 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret ``` </details> <details> <summary>After this PR:</summary> ```asm unwrap_clone: mov rax, rdi movups xmm0, xmmword ptr [rsi] mov rcx, qword ptr [rsi + 16] movups xmmword ptr [rdi], xmm0 mov qword ptr [rdi + 16], rcx ret ``` </details> Fixes rust-lang/rust#120493

nikic · 2024-08-01T14:00:43Z

Not sure how to confirm whether this is really fixed now or not.

the8472 · 2024-08-01T14:24:10Z

I'll dedust my branch and check it locally.

lolbinarycat · 2024-09-08T21:02:02Z

triage: original repro indeed fixed, @the8472 any news on that local check?

the8472 · 2024-10-05T15:40:29Z

sub_ptr is working again.

the8472 added I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. C-bug Category: This is a bug. labels Feb 17, 2024

rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Feb 17, 2024

fmease changed the title ~~Noop-loop is longer optimized away on Nightly~~ Noop-loop is no longer optimized away on Nightly Feb 17, 2024

matthiaskrgr added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Feb 17, 2024

workingjubilee added the A-codegen Area: Code generation label Feb 17, 2024

rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Feb 19, 2024

nikic self-assigned this Feb 20, 2024

This was referenced Feb 21, 2024

Update to LLVM 18.1.0 rc 3 rust-lang/llvm-project#167

Merged

Update to LLVM 18.1.0 rc 4 #121395

Merged

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 21, 2024

Auto merge of rust-lang#121395 - nikic:update-llvm-21, r=cuviper

729a664

Update to LLVM 18.1.0 rc 3 Fixes rust-lang#120819. Fixes rust-lang#121180. Fixes rust-lang#121239. Fixes rust-lang#121367.

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 22, 2024

Auto merge of rust-lang#121395 - nikic:update-llvm-21, r=<try>

da675ab

Update to LLVM 18.1.0 rc 3 Fixes rust-lang#120819. Fixes rust-lang#121180. Fixes rust-lang#121239. Fixes rust-lang#121367.

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 22, 2024

Auto merge of rust-lang#121395 - nikic:update-llvm-21, r=<try>

6793580

Update to LLVM 18.1.0 rc 3 Fixes rust-lang#120819. Fixes rust-lang#121180. Fixes rust-lang#121239. Fixes rust-lang#121367.

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 22, 2024

Auto merge of rust-lang#121395 - nikic:update-llvm-21, r=<try>

8866db4

Update to LLVM 18.1.0 rc 3 Fixes rust-lang#120819. Fixes rust-lang#121180. Fixes rust-lang#121239. Fixes rust-lang#121367.

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 24, 2024

Auto merge of rust-lang#121395 - nikic:update-llvm-21, r=<try>

7c9bfbe

Update to LLVM 18.1.0 rc 3 Fixes rust-lang#120819. Fixes rust-lang#121180. Fixes rust-lang#121239. Fixes rust-lang#121367.

bors added a commit to rust-lang-ci/rust that referenced this issue Mar 1, 2024

Auto merge of rust-lang#121395 - nikic:update-llvm-21, r=<try>

5e826de

Update to LLVM 18.1.0 rc 3 Fixes rust-lang#120819. Fixes rust-lang#121180. Fixes rust-lang#121239. Fixes rust-lang#121367.

bors closed this as completed in e612d07 Mar 2, 2024

the8472 reopened this Mar 6, 2024

the8472 added the llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes label Mar 6, 2024

jwong101 mentioned this issue Apr 13, 2024

optimize inplace collection of Vec #123878

Merged

lolbinarycat added the S-needs-repro Status: This issue has no reproduction and needs a reproduction to make progress. label Sep 8, 2024

the8472 closed this as completed Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Noop-loop is no longer optimized away on Nightly #121239

Noop-loop is no longer optimized away on Nightly #121239

the8472 commented Feb 17, 2024

matthiaskrgr commented Feb 17, 2024

matthiaskrgr commented Feb 17, 2024

apiraino commented Feb 19, 2024

the8472 commented Feb 19, 2024

nikic commented Feb 20, 2024

nikic commented Feb 20, 2024

nikic commented Feb 20, 2024

nikic commented Feb 20, 2024

the8472 commented Mar 6, 2024

the8472 commented Mar 6, 2024

nikic commented Mar 6, 2024

the8472 commented Mar 6, 2024

nikic commented Mar 6, 2024

nikic commented Mar 6, 2024

nikic commented Aug 1, 2024

the8472 commented Aug 1, 2024 •

edited

Loading

lolbinarycat commented Sep 8, 2024

the8472 commented Oct 5, 2024

Noop-loop is no longer optimized away on Nightly #121239

Noop-loop is no longer optimized away on Nightly #121239

Comments

the8472 commented Feb 17, 2024

Meta

matthiaskrgr commented Feb 17, 2024

matthiaskrgr commented Feb 17, 2024

apiraino commented Feb 19, 2024

the8472 commented Feb 19, 2024

nikic commented Feb 20, 2024

nikic commented Feb 20, 2024

nikic commented Feb 20, 2024

nikic commented Feb 20, 2024

the8472 commented Mar 6, 2024

the8472 commented Mar 6, 2024

nikic commented Mar 6, 2024

the8472 commented Mar 6, 2024

nikic commented Mar 6, 2024

nikic commented Mar 6, 2024

nikic commented Aug 1, 2024

the8472 commented Aug 1, 2024 • edited Loading

lolbinarycat commented Sep 8, 2024

the8472 commented Oct 5, 2024

the8472 commented Aug 1, 2024 •

edited

Loading