Nearly 50% performance regression when copying/moving small byte arrays between 1.71 and 1.72 #115212

tpelkone · 2023-08-25T13:48:06Z

Summary

Nearly 50% performance regression between versions 1.71 and 1.72 when copying small byte arrays with stable-aarch64-apple-darwin platform (Macbook Pro M1). Used cargo bisect-rustc to track down the regression to fd9bf59 #111999

Code

I tried this code:

#![feature(test)]
extern crate test;
use test::Bencher;

#[bench]
fn bench(b: &mut Bencher) {
    b.iter(|| {
        let data = vec![[1u8; 24]; 1024 * 1024];
        let mut new_vec = Vec::with_capacity(1024 * 1024);
        for entry in data {
            new_vec.push(entry);
        }
    });
}

I expected to see this happen: to run about 6,200,000 ns/iter like version 1.71

$ cargo +nightly-2023-06-06-aarch64-apple-darwin bench
   Compiling benchmarks v0.1.0 (/Users/tuomas/rust/benchmarks)
    Finished bench [optimized] target(s) in 0.18s
     Running unittests src/main.rs (target/release/deps/benchmarks-0c0dfc3834e16b96)

running 1 test
test bench ... bench:   6,249,724 ns/iter (+/- 1,609,874)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out; finished in 5.70s

Instead, this happened: 9,159,412 ns/iter

$ cargo +nightly-2023-06-07-aarch64-apple-darwin bench
    Finished bench [optimized] target(s) in 0.00s
     Running unittests src/main.rs (target/release/deps/benchmarks-0c0dfc3834e16b96)

running 1 test
test bench ... bench:   9,159,412 ns/iter (+/- 469,568)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out; finished in 2.78s

Version it worked on

It most recently worked on: 1.71
More specifically nightly-2023-06-06-aarch64-apple-darwin

Version with regression

1.72
nightly-2023-06-07-aarch64-apple-darwin

rustc --version --verbose:

rustc 1.72.0 (5680fa18f 2023-08-23)
binary: rustc
commit-hash: 5680fa18feaa87f3ff04063800aec256c3d4b4be
commit-date: 2023-08-23
host: aarch64-apple-darwin
release: 1.72.0
LLVM version: 16.0.5

cargo bisect-rustc

Searched nightlies: from nightly-2023-06-06 to nightly-2023-06-07
regressed nightly: nightly-2023-06-07
searched commit range: e6d4725...b2b34bd
regressed commit: fd9bf59

bisected with cargo-bisect-rustc v0.6.7

Host triple: x86_64-apple-darwin
Reproduce with:

cargo bisect-rustc --start=2023-06-06 --end=2023-06-07 -- run --release

@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged

The text was updated successfully, but these errors were encountered:

saethlin · 2023-08-25T18:55:57Z

Here's a demo for [u8; 31]: https://godbolt.org/z/c3rdPM8K4

erikdesjardins · 2023-08-25T19:11:49Z

Here's a demo for [u8; 31]: godbolt.org/z/c3rdPM8K4

Oof, that's massively worse. But even with a nice size ([u8; 32]), it's still a small regression: https://godbolt.org/z/1PW1xbcjn

We should probably revert #111999, or limit it to very small arrays (<= 8 bytes / 1 usize? https://godbolt.org/z/6sWdYaeEx)

saethlin · 2023-08-25T19:30:12Z

The original motivation of that PR is for arrays larger than that, so I'm not so sure.

If I strip this down to just the pointer write it's not as bad, but it's quite clear that the problem is just x86-isel: https://godbolt.org/z/qh89c85P8 Is that even good instruction selection?

scottmcm · 2023-08-25T22:02:35Z

Hmm, curiously looking at just the part of the push logic that copies the value, the assembly is entirely reasonable:

pub fn push(v: &mut Vec<[u8; 24]>, x: [u8; 24]) {
    if v.len() < v.capacity() { v.push(x); }
}

https://rust.godbolt.org/z/45fEo314o

example::push:
        mov     rax, qword ptr [rdi + 16]
        cmp     rax, qword ptr [rdi + 8]
        jae     .LBB0_2
        movups  xmm0, xmmword ptr [rsi]
        mov     rcx, qword ptr [rsi + 16]
        mov     rdx, qword ptr [rdi]
        lea     rsi, [rax + 2*rax]
        mov     qword ptr [rdx + 8*rsi + 16], rcx
        movups  xmmword ptr [rdx + 8*rsi], xmm0
        inc     rax
        mov     qword ptr [rdi + 16], rax
.LBB0_2:
        ret

But then in the context of a "real" push, it somehow still blows up: https://rust.godbolt.org/z/zW34r7h1j

scottmcm · 2023-08-26T03:16:11Z

Here's a small tweak to try to fix this for 1.73: #115236

scottmcm · 2023-08-26T08:49:16Z

I've opened #115242 to track the suboptimal push codegen mentioned in #115212 (comment)

@tpelkone Once a new nightly is out, please give it a shot and reopen if you're still seeing the regression.

mati865 · 2023-08-26T09:16:23Z

On Ubuntu VM running on Ryzen 5950X this is far worse:

# 1.71.1
test bench ... bench:   6,363,711 ns/iter (+/- 311,522)
# 1,72,0
test bench ... bench:  15,529,123 ns/iter (+/- 656,905)

2.4x regression.

tpelkone · 2023-08-27T03:16:37Z

@scottmcm Things look much better with rust version 1.74.0-nightly (69e97df5c 2023-08-26). Thanks for the quick fix!

scottmcm · 2023-08-27T03:46:05Z

@tpelkone Thanks for confirming! I'll see if it can make 1.73.

tpelkone added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Aug 25, 2023

tpelkone mentioned this issue Aug 25, 2023

Use load+store instead of memcpy for small integer arrays #111999

Merged

scottmcm self-assigned this Aug 25, 2023

scottmcm mentioned this issue Aug 26, 2023

Stop emitting non-power-of-two vectors in (non-portable-SIMD) codegen #115236

Merged

bors closed this as completed in 766c0c0 Aug 26, 2023

scottmcm mentioned this issue Aug 26, 2023

Vec::push should load a passed-by-pointer argument *after* reserve_for_push #115242

Open

apiraino removed the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Aug 30, 2023

fintelia mentioned this issue Sep 27, 2023

Using std::simd to speed-up unfilter for Paeth for bpp=3 and bpp=6 image-rs/image-png#414

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nearly 50% performance regression when copying/moving small byte arrays between 1.71 and 1.72 #115212

Nearly 50% performance regression when copying/moving small byte arrays between 1.71 and 1.72 #115212

tpelkone commented Aug 25, 2023

saethlin commented Aug 25, 2023

erikdesjardins commented Aug 25, 2023

saethlin commented Aug 25, 2023

scottmcm commented Aug 25, 2023

scottmcm commented Aug 26, 2023

scottmcm commented Aug 26, 2023

mati865 commented Aug 26, 2023

tpelkone commented Aug 27, 2023

scottmcm commented Aug 27, 2023

Nearly 50% performance regression when copying/moving small byte arrays between 1.71 and 1.72 #115212

Nearly 50% performance regression when copying/moving small byte arrays between 1.71 and 1.72 #115212

Comments

tpelkone commented Aug 25, 2023

Summary

Code

Version it worked on

Version with regression

cargo bisect-rustc

saethlin commented Aug 25, 2023

erikdesjardins commented Aug 25, 2023

saethlin commented Aug 25, 2023

scottmcm commented Aug 25, 2023

scottmcm commented Aug 26, 2023

scottmcm commented Aug 26, 2023

mati865 commented Aug 26, 2023

tpelkone commented Aug 27, 2023

scottmcm commented Aug 27, 2023