Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate pairwise sum/product flops for reductions/horizontal ops #235

Open
workingjubilee opened this issue Jan 31, 2022 · 38 comments
Open
Labels
A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: LLVM blocks-stable C-feature-request Category: a feature request, i.e. not implemented / a PR

Comments

@workingjubilee
Copy link
Member

workingjubilee commented Jan 31, 2022

Currently, the horizontal ops use a sequential ordering. This ordering shouldn't matter at all for integers or operations like min/max, because those should be associative, last I checked.

But it does matter for floats and their sums/products, a lot, as using a strictly sequential order not only is not very fast due to introducing data dependencies, it also introduces close to the maximum error possible. IEEE-754 states the order of the sum function on a vector may be reassociated, almost certainly precisely for this reason, and most of the hardware we are targeting supports pairwise summation or product operations. These are also called "tree reductions" for reasons that are obvious when you tilt your head and actually stare at what happens. They should be reasonable to emulate in software even in their absence.

I know @gnzlbg was intending to use tree reductions for packed_simd. We can just add the "allow reassociation" flags to the ops, and that was what the intrinsics originally did, but that makes the order totally unspecified, unfortunately. Perhaps LLVM always uses tree reductions to implement this, but they don't promise to do so. It does seem to make things go faster, though:
https://llvm.godbolt.org/z/dhbdoeq4d

We would like to still have a consistent ordering, as that would be most portable (the algorithm would produce the same results on all machines). We should explore if it's possible to consistently make LLVM emit pairwise sums of floats ops, if we can, and the performance implications.

This got discussed extensively in the original Rust Zulip thread.
There's a Julia forum thread on summation algorithms that probably contains relevant content.

We should adopt a resolution to commit to a horizontal_{sum,product} ordering (or lack thereof) before we stabilize, so marking this as blocking stable. Some actionable steps to take first:

  • ask other stakeholders (read: backend maintainers) like @bjorn3 and @antoyo what they think would actually be best.
  • try to emulate tree reductions using swizzles and see if LLVM is earning its paycheck?
  • produce a few examples we can actually benchmark that do piles of summations and products and try out various reduction algorithms (sequential, pairwise, etc.) with them.
@workingjubilee workingjubilee added A-floating-point Area: Floating point numbers and arithmetic C-feature-request Category: a feature request, i.e. not implemented / a PR A-LLVM Area: LLVM blocks-stable labels Jan 31, 2022
@programmerjake
Copy link
Member

proposed tree reduction algorithm (compatible with SimpleV's tree reduction algorithm):
https://rust.godbolt.org/z/vfjb4c5WE

#![feature(portable_simd)]
use std::simd::{Simd, LaneCount, SupportedLaneCount, Swizzle};

const fn make_swizzle<const N: usize>(step: usize) -> [usize; N] {
    let mut retval = [0; N];
    let mut i = 0;
    while i < N {
        retval[i] = i - i % step + step / 2;
        if retval[i] >= N {
            retval[i] = 0;
        }
        i += 1;
    }
    retval
}

struct S<const N: usize, const STEP: usize>;

impl<const N: usize, const STEP: usize> Swizzle<N, N> for S<N, STEP> {
    const INDEX: [usize; N] = make_swizzle::<N>(STEP);
}

pub fn tree_reduce<const N: usize>(mut vec: Simd<f32, N>) -> f32
where
    LaneCount<N>: SupportedLaneCount,
{
    macro_rules! step {
        ($step:literal) => {
            if $step < N * 2 {
                vec += S::<N, $step>::swizzle(vec);
            }
        };
    }
    step!(2);
    step!(4);
    step!(8);
    step!(16);
    step!(32);
    step!(64);
    vec[0]
}

@workingjubilee
Copy link
Member Author

Wow,

pub fn horizontal_sum(x: Simd<f32, $N>) -> f32 {
    x.horizontal_sum()
}

completely scalarizes (using vaddss) for all N that I tried.

example::horizontal_sum_8:
        vxorps  xmm0, xmm0, xmm0
        vaddss  xmm0, xmm0, dword ptr [rdi]
        vaddss  xmm0, xmm0, dword ptr [rdi + 4]
        vaddss  xmm0, xmm0, dword ptr [rdi + 8]
        vaddss  xmm0, xmm0, dword ptr [rdi + 12]
        vaddss  xmm0, xmm0, dword ptr [rdi + 16]
        vaddss  xmm0, xmm0, dword ptr [rdi + 20]
        vaddss  xmm0, xmm0, dword ptr [rdi + 24]
        vaddss  xmm0, xmm0, dword ptr [rdi + 28]
        ret

@pthariensflame
Copy link
Contributor

pthariensflame commented Feb 1, 2022

proposed tree reduction algorithm (compatible with SimpleV's tree reduction algorithm): https://rust.godbolt.org/z/vfjb4c5WE

#![feature(portable_simd)]
use std::simd::{Simd, LaneCount, SupportedLaneCount, Swizzle};

const fn make_swizzle<const N: usize>(step: usize) -> [usize; N] {
    let mut retval = [0; N];
    let mut i = 0;
    while i < N {
        retval[i] = i - i % step + step / 2;
        if retval[i] >= N {
            retval[i] = 0;
        }
        i += 1;
    }
    retval
}

struct S<const N: usize, const STEP: usize>;

impl<const N: usize, const STEP: usize> Swizzle<N, N> for S<N, STEP> {
    const INDEX: [usize; N] = make_swizzle::<N>(STEP);
}

pub fn tree_reduce<const N: usize>(mut vec: Simd<f32, N>) -> f32
where
    LaneCount<N>: SupportedLaneCount,
{
    macro_rules! step {
        ($step:literal) => {
            if $step < N * 2 {
                vec += S::<N, $step>::swizzle(vec);
            }
        };
    }
    step!(2);
    step!(4);
    step!(8);
    step!(16);
    step!(32);
    step!(64);
    vec[0]
}

This doesn’t seem to work as well on AArch64, where it only makes use of FADDP for the 2 case and in higher cases does exponentially increasing “useless work”.

EDIT: This behavior gets worse when the ISA supports SVE(2), where it should be happy to use FADDV but instead doesn’t use SVE instructions at all.

@programmerjake
Copy link
Member

it looks better if we reverse the order of the shuffles:
https://rust.godbolt.org/z/o693E4Mc7

Maybe I can convince Libre-SOC to change SimpleV to match if this is a sufficiently good motivation.

#![feature(portable_simd)]
use std::simd::{Simd, LaneCount, SupportedLaneCount, Swizzle};

const fn make_swizzle<const N: usize>(step: usize) -> [usize; N] {
    let mut retval = [0; N];
    let mut i = 0;
    while i < N {
        retval[i] = i + step / 2;
        if retval[i] >= N {
            retval[i] = 0;
        }
        i += 1;
    }
    retval
}

struct S<const N: usize, const STEP: usize>;

impl<const N: usize, const STEP: usize> Swizzle<N, N> for S<N, STEP> {
    const INDEX: [usize; N] = make_swizzle::<N>(STEP);
}

pub fn tree_reduce<const N: usize>(mut vec: Simd<f32, N>) -> f32
where
    LaneCount<N>: SupportedLaneCount,
{
    macro_rules! step {
        ($step:literal) => {
            if $step < N * 2 {
                vec += S::<N, $step>::swizzle(vec);
            }
        };
    }
    step!(64);
    step!(32);
    step!(16);
    step!(8);
    step!(4);
    step!(2);
    vec[0]
}

@programmerjake
Copy link
Member

@programmerjake
Copy link
Member

nice ascii-art diagrams:

... reversed means it reduces like so (@ are adds;
this one is more efficient on arm):

0    1    2    3    4    5    6    7
|    |    |    |    |    |    |    |
@----|----|----|----/    |    |    |
|    @----|----|---------/    |    | distance=4
|    |    @----|--------------/    |
|    |    |    @-------------------/
|    |    |    |
@----|----/    | distance=2
|    @---------/
|    |
@----/ distance=1
|

the original (current SimpleV) reduce goes distance=1,2,4...

0    1    2    3    4    5    6    7
|    |    |    |    |    |    |    |
@----/    @----/    @----/    @----/ distance=1
|    .    |    .    |    .    |
@---------/    .    @---------/ distance=2
|    .    .    .    |
@-------------------/ distance=4
|

@programmerjake
Copy link
Member

turns out Arm's faddv instruction which does reduction uses a pattern like the current SimpleV and what I originally proposed in this thread (distance=1,2,4 from diagram) -- except that it only matches when the vector length is a power of 2.

@programmerjake
Copy link
Member

llvm defaults to the distance=4,2,1 variant...it doesn't support non-power-of-2 vector lengths:
https://github.com/llvm/llvm-project/blob/ec00c9cdeb5ee9fc7846cb3d2a4d53eba2c35a43/llvm/lib/Transforms/Utils/LoopUtils.cpp#L959

@workingjubilee
Copy link
Member Author

Hm, that does look a lot better.

example::tree_reduce_4:
        ldr     q0, [x0]
        dup     v1.2d, v0.d[1]
        fadd    v0.4s, v0.4s, v1.4s
        faddp   s0, v0.2s
        ret

example::tree_reduce_8:
        ldp     q1, q0, [x0]
        fadd    v0.4s, v1.4s, v0.4s
        dup     v1.2d, v0.d[1]
        fadd    v0.4s, v0.4s, v1.4s
        faddp   s0, v0.2s
        ret

example::tree_reduce_16:
        ldp     q0, q1, [x0, #32]
        ldp     q3, q2, [x0]
        fadd    v1.4s, v2.4s, v1.4s
        fadd    v0.4s, v3.4s, v0.4s
        fadd    v0.4s, v0.4s, v1.4s
        dup     v1.2d, v0.d[1]
        fadd    v0.4s, v0.4s, v1.4s
        faddp   s0, v0.2s
        ret

Both versions seem to optimize... less than spectacularly when I pass higher versions of -Cllvm-args="--aarch64-sve-vector-bits-min={N}". But the second version goes nicely on AVX512 for the really big ones:

example::tree_reduce_64:
        vmovaps zmm0, zmmword ptr [rdi + 64]
        vmovaps zmm1, zmmword ptr [rdi + 128]
        vaddps  zmm2, zmm1, zmmword ptr [rdi]
        vaddps  zmm1, zmm1, dword ptr [rdi]{1to16}
        vaddps  zmm0, zmm0, zmmword ptr [rdi + 192]
        vaddps  zmm1, zmm0, zmm1
        vaddps  zmm0, zmm2, zmm0
        vshuff64x2      zmm1, zmm0, zmm1, 78
        vaddps  zmm0, zmm0, zmm1
        vextractf128    xmm1, ymm0, 1
        vaddps  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 1
        vaddps  xmm0, xmm0, xmm1
        vmovshdup       xmm1, xmm0
        vaddss  xmm0, xmm0, xmm1
        vzeroupper
        ret

@pthariensflame
Copy link
Contributor

pthariensflame commented Feb 2, 2022

They’re “better”, but still pretty bad. I would have expected, say,

example::tree_reduce_8:
        ldp     q1, q0, [x0]
        fadd    v0.4s, v1.4s, v0.4s
        mov     d1, v0.d[1]
        fadd    v0.2s, v0.2s, v1.2s
        faddp   s0, v0.2s
        ret

since it doesn’t need all 128 bits for the second half. And it gets progressively worse the higher you go.

@pthariensflame
Copy link
Contributor

pthariensflame commented Feb 2, 2022

Also, the two-input vector form of FADDP should be more than capable of handling the original order, like so:

example::tree_reduce_16:
        ldp     q0, q1, [x0, #32]
        ldp     q2, q3, [x0]
        faddp    v0.4s, v1.4s, v0.4s
        faddp    v2.4s, v3.4s, v2.4s
        faddp    v0.4s, v2.4s, v0.4s
        mov     d1, v0.d[1]
        faddp    v0.2s, v1.2s, v0.2s
        faddp    s0, v0.2s
        ret

@workingjubilee
Copy link
Member Author

workingjubilee commented Feb 2, 2022

The extra dup instead of a mov is weird but one relatively low-cost instruction different from a preferable form on the 8-lane version is not actually going to make or break our decisions here, since what we're truly evaluating against is not the ideal hand-rolled assembler: it is an N-instruction serial fadd (or fadda on Arm... but that's not guaranteed to be faster, and it's definitely not more precise).

The issues we are encountering here seem to be due to a lack of education on LLVM's part, and we can only do so much to tack against that wind.

Jacob seems to have cooked up a variant that uses the "obvious" reduction algorithm we have been expecting would apply (and should allow using faddv, LLVM just doesn't seem to ever emit that...?), and it is not ideal either, but it's a fair bit better while still being Rust-defined: https://rust.godbolt.org/z/fxYE3Yvzb

@programmerjake programmerjake changed the title Investigate pairwise sum/product flops Investigate pairwise sum/product flops for reductions/horizontal ops Feb 8, 2022
@antoyo
Copy link

antoyo commented Feb 19, 2022

I'm not sure how I can help here, but as I implemented some SIMD stuff in rustc_codegen_gcc recently, I was able to try both tree reduction algorithms.
Here are the results:

First
000000000025ded0 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_1>:
  25ded0:	48 83 ec 18          	sub    rsp,0x18
  25ded4:	8b 07                	mov    eax,DWORD PTR [rdi]
  25ded6:	48 8d 15 2b 51 4c 00 	lea    rdx,[rip+0x4c512b]        # 723008 <global_1_test_backend_5d019e9d_cgu_0>
  25dedd:	31 f6                	xor    esi,esi
  25dedf:	48 8d 7c 24 08       	lea    rdi,[rsp+0x8]
  25dee4:	89 44 24 08          	mov    DWORD PTR [rsp+0x8],eax
  25dee8:	e8 e3 01 00 00       	call   25e0d0 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj1_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
  25deed:	c5 fa 10 00          	vmovss xmm0,DWORD PTR [rax]
  25def1:	48 83 c4 18          	add    rsp,0x18
  25def5:	c3                   	ret    
  25def6:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  25defd:	00 00 00 

000000000025df00 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_2>:
  25df00:	48 83 ec 18          	sub    rsp,0x18
  25df04:	c5 fa 7e 0f          	vmovq  xmm1,QWORD PTR [rdi]
  25df08:	48 8d 15 f9 50 4c 00 	lea    rdx,[rip+0x4c50f9]        # 723008 <global_1_test_backend_5d019e9d_cgu_0>
  25df0f:	31 f6                	xor    esi,esi
  25df11:	48 8d 7c 24 08       	lea    rdi,[rsp+0x8]
  25df16:	c5 fa 16 c1          	vmovshdup xmm0,xmm1
  25df1a:	c5 f8 58 c1          	vaddps xmm0,xmm0,xmm1
  25df1e:	c5 f8 13 44 24 08    	vmovlps QWORD PTR [rsp+0x8],xmm0
  25df24:	e8 c7 01 00 00       	call   25e0f0 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj2_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
  25df29:	c5 fa 10 00          	vmovss xmm0,DWORD PTR [rax]
  25df2d:	48 83 c4 18          	add    rsp,0x18
  25df31:	c3                   	ret    
  25df32:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  25df39:	00 00 00 
  25df3c:	0f 1f 40 00          	nop    DWORD PTR [rax+0x0]

000000000025df40 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_4>:
  25df40:	48 83 ec 18          	sub    rsp,0x18
  25df44:	c4 e3 79 04 07 f5    	vpermilps xmm0,XMMWORD PTR [rdi],0xf5
  25df4a:	c5 f8 58 07          	vaddps xmm0,xmm0,XMMWORD PTR [rdi]
  25df4e:	31 f6                	xor    esi,esi
  25df50:	48 8d 15 b1 50 4c 00 	lea    rdx,[rip+0x4c50b1]        # 723008 <global_1_test_backend_5d019e9d_cgu_0>
  25df57:	48 89 e7             	mov    rdi,rsp
  25df5a:	c4 e3 79 04 c8 aa    	vpermilps xmm1,xmm0,0xaa
  25df60:	c5 f0 58 c0          	vaddps xmm0,xmm1,xmm0
  25df64:	c5 f8 29 04 24       	vmovaps XMMWORD PTR [rsp],xmm0
  25df69:	e8 a2 01 00 00       	call   25e110 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj4_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
  25df6e:	c5 fa 10 00          	vmovss xmm0,DWORD PTR [rax]
  25df72:	48 83 c4 18          	add    rsp,0x18
  25df76:	c3                   	ret    
  25df77:	66 0f 1f 84 00 00 00 	nop    WORD PTR [rax+rax*1+0x0]
  25df7e:	00 00 

000000000025df80 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_8>:
  25df80:	55                   	push   rbp
  25df81:	31 f6                	xor    esi,esi
  25df83:	48 89 e5             	mov    rbp,rsp
  25df86:	48 83 e4 e0          	and    rsp,0xffffffffffffffe0
  25df8a:	48 83 ec 20          	sub    rsp,0x20
  25df8e:	c4 e3 7d 04 0f f5    	vpermilps ymm1,YMMWORD PTR [rdi],0xf5
  25df94:	c5 f4 58 0f          	vaddps ymm1,ymm1,YMMWORD PTR [rdi]
  25df98:	48 8d 15 69 50 4c 00 	lea    rdx,[rip+0x4c5069]        # 723008 <global_1_test_backend_5d019e9d_cgu_0>
  25df9f:	48 89 e7             	mov    rdi,rsp
  25dfa2:	c4 e3 7d 04 c1 aa    	vpermilps ymm0,ymm1,0xaa
  25dfa8:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25dfac:	c4 e3 7d 04 c8 00    	vpermilps ymm1,ymm0,0x0
  25dfb2:	c4 e3 75 06 c9 11    	vperm2f128 ymm1,ymm1,ymm1,0x11
  25dfb8:	c5 f4 58 c0          	vaddps ymm0,ymm1,ymm0
  25dfbc:	c5 fc 29 04 24       	vmovaps YMMWORD PTR [rsp],ymm0
  25dfc1:	c5 f8 77             	vzeroupper 
  25dfc4:	e8 67 01 00 00       	call   25e130 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj8_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
  25dfc9:	c5 fa 10 00          	vmovss xmm0,DWORD PTR [rax]
  25dfcd:	c9                   	leave  
  25dfce:	c3                   	ret    
  25dfcf:	90                   	nop

000000000025dfd0 <_RNvCs1fmkINL2TB9_12test_backend14tree_reduce_16>:
  25dfd0:	55                   	push   rbp
  25dfd1:	31 f6                	xor    esi,esi
  25dfd3:	48 89 e5             	mov    rbp,rsp
  25dfd6:	48 83 e4 c0          	and    rsp,0xffffffffffffffc0
  25dfda:	48 83 ec 40          	sub    rsp,0x40
  25dfde:	c5 fa 10 7f 1c       	vmovss xmm7,DWORD PTR [rdi+0x1c]
  25dfe3:	c5 fa 10 5f 14       	vmovss xmm3,DWORD PTR [rdi+0x14]
  25dfe8:	c5 fa 10 77 0c       	vmovss xmm6,DWORD PTR [rdi+0xc]
  25dfed:	c5 fa 10 4f 04       	vmovss xmm1,DWORD PTR [rdi+0x4]
  25dff2:	c5 fa 10 6f 3c       	vmovss xmm5,DWORD PTR [rdi+0x3c]
  25dff7:	c5 fa 10 57 34       	vmovss xmm2,DWORD PTR [rdi+0x34]
  25dffc:	c5 c0 14 ff          	vunpcklps xmm7,xmm7,xmm7
  25e000:	c5 e0 14 db          	vunpcklps xmm3,xmm3,xmm3
  25e004:	c5 fa 10 67 2c       	vmovss xmm4,DWORD PTR [rdi+0x2c]
  25e009:	c5 c8 14 f6          	vunpcklps xmm6,xmm6,xmm6
  25e00d:	c5 f0 14 c9          	vunpcklps xmm1,xmm1,xmm1
  25e011:	c5 e0 16 df          	vmovlhps xmm3,xmm3,xmm7
  25e015:	c5 fa 10 47 24       	vmovss xmm0,DWORD PTR [rdi+0x24]
  25e01a:	c5 f0 16 ce          	vmovlhps xmm1,xmm1,xmm6
  25e01e:	c5 d0 14 ed          	vunpcklps xmm5,xmm5,xmm5
  25e022:	c5 e8 14 d2          	vunpcklps xmm2,xmm2,xmm2
  25e026:	c5 d8 14 e4          	vunpcklps xmm4,xmm4,xmm4
  25e02a:	c5 e8 16 d5          	vmovlhps xmm2,xmm2,xmm5
  25e02e:	c4 e3 75 18 cb 01    	vinsertf128 ymm1,ymm1,xmm3,0x1
  25e034:	48 8d 15 cd 4f 4c 00 	lea    rdx,[rip+0x4c4fcd]        # 723008 <global_1_test_backend_5d019e9d_cgu_0>
  25e03b:	c5 f8 14 c0          	vunpcklps xmm0,xmm0,xmm0
  25e03f:	c5 f4 58 0f          	vaddps ymm1,ymm1,YMMWORD PTR [rdi]
  25e043:	c5 f8 16 c4          	vmovlhps xmm0,xmm0,xmm4
  25e047:	c4 e3 7d 18 c2 01    	vinsertf128 ymm0,ymm0,xmm2,0x1
  25e04d:	c5 fc 58 47 20       	vaddps ymm0,ymm0,YMMWORD PTR [rdi+0x20]
  25e052:	48 89 e7             	mov    rdi,rsp
  25e055:	c4 e3 7d 04 d9 aa    	vpermilps ymm3,ymm1,0xaa
  25e05b:	c5 e4 58 d9          	vaddps ymm3,ymm3,ymm1
  25e05f:	c4 e3 7d 04 c8 aa    	vpermilps ymm1,ymm0,0xaa
  25e065:	c5 f4 58 c0          	vaddps ymm0,ymm1,ymm0
  25e069:	c4 e3 7d 04 d0 00    	vpermilps ymm2,ymm0,0x0
  25e06f:	c4 e3 6d 06 d2 11    	vperm2f128 ymm2,ymm2,ymm2,0x11
  25e075:	c5 ec 58 d0          	vaddps ymm2,ymm2,ymm0
  25e079:	c4 e3 7d 04 c3 00    	vpermilps ymm0,ymm3,0x0
  25e07f:	c4 e3 7d 06 c0 11    	vperm2f128 ymm0,ymm0,ymm0,0x11
  25e085:	c5 fc 58 c3          	vaddps ymm0,ymm0,ymm3
  25e089:	c4 e2 7d 18 ca       	vbroadcastss ymm1,xmm2
  25e08e:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25e092:	c5 f4 58 ca          	vaddps ymm1,ymm1,ymm2
  25e096:	c5 fc 29 04 24       	vmovaps YMMWORD PTR [rsp],ymm0
  25e09b:	c5 fc 29 4c 24 20    	vmovaps YMMWORD PTR [rsp+0x20],ymm1
  25e0a1:	c5 f8 77             	vzeroupper 
  25e0a4:	e8 07 00 00 00       	call   25e0b0 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj10_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
  25e0a9:	c5 fa 10 00          	vmovss xmm0,DWORD PTR [rax]
  25e0ad:	c9                   	leave  
  25e0ae:	c3                   	ret    
  25e0af:	90                   	nop
Second
000000000025ded0 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_1>:
  25ded0:	48 83 ec 18          	sub    rsp,0x18
  25ded4:	8b 07                	mov    eax,DWORD PTR [rdi]
  25ded6:	48 8d 15 2b 51 4c 00 	lea    rdx,[rip+0x4c512b]        # 723008 <global_1_test_backend_5d019e9d_cgu_0>
  25dedd:	31 f6                	xor    esi,esi
  25dedf:	48 8d 7c 24 08       	lea    rdi,[rsp+0x8]
  25dee4:	89 44 24 08          	mov    DWORD PTR [rsp+0x8],eax
  25dee8:	e8 13 02 00 00       	call   25e100 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj1_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
  25deed:	c5 fa 10 00          	vmovss xmm0,DWORD PTR [rax]
  25def1:	48 83 c4 18          	add    rsp,0x18
  25def5:	c3                   	ret    
  25def6:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  25defd:	00 00 00 

000000000025df00 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_2>:
  25df00:	48 83 ec 18          	sub    rsp,0x18
  25df04:	c5 fa 7e 0f          	vmovq  xmm1,QWORD PTR [rdi]
  25df08:	48 8d 15 f9 50 4c 00 	lea    rdx,[rip+0x4c50f9]        # 723008 <global_1_test_backend_5d019e9d_cgu_0>
  25df0f:	31 f6                	xor    esi,esi
  25df11:	48 8d 7c 24 08       	lea    rdi,[rsp+0x8]
  25df16:	c5 f0 c6 c1 e1       	vshufps xmm0,xmm1,xmm1,0xe1
  25df1b:	c5 f8 58 c1          	vaddps xmm0,xmm0,xmm1
  25df1f:	c5 f8 13 44 24 08    	vmovlps QWORD PTR [rsp+0x8],xmm0
  25df25:	e8 f6 01 00 00       	call   25e120 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj2_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
  25df2a:	c5 fa 10 00          	vmovss xmm0,DWORD PTR [rax]
  25df2e:	48 83 c4 18          	add    rsp,0x18
  25df32:	c3                   	ret    
  25df33:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  25df3a:	00 00 00 
  25df3d:	0f 1f 00             	nop    DWORD PTR [rax]

000000000025df40 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_4>:
  25df40:	48 83 ec 18          	sub    rsp,0x18
  25df44:	c4 e3 79 04 07 0e    	vpermilps xmm0,XMMWORD PTR [rdi],0xe
  25df4a:	c5 f8 58 07          	vaddps xmm0,xmm0,XMMWORD PTR [rdi]
  25df4e:	31 f6                	xor    esi,esi
  25df50:	48 8d 15 b1 50 4c 00 	lea    rdx,[rip+0x4c50b1]        # 723008 <global_1_test_backend_5d019e9d_cgu_0>
  25df57:	48 89 e7             	mov    rdi,rsp
  25df5a:	c4 e3 79 04 c8 39    	vpermilps xmm1,xmm0,0x39
  25df60:	c5 f0 58 c0          	vaddps xmm0,xmm1,xmm0
  25df64:	c5 f8 29 04 24       	vmovaps XMMWORD PTR [rsp],xmm0
  25df69:	e8 d2 01 00 00       	call   25e140 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj4_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
  25df6e:	c5 fa 10 00          	vmovss xmm0,DWORD PTR [rax]
  25df72:	48 83 c4 18          	add    rsp,0x18
  25df76:	c3                   	ret    
  25df77:	66 0f 1f 84 00 00 00 	nop    WORD PTR [rax+rax*1+0x0]
  25df7e:	00 00 

000000000025df80 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_8>:
  25df80:	55                   	push   rbp
  25df81:	31 f6                	xor    esi,esi
  25df83:	48 89 e5             	mov    rbp,rsp
  25df86:	48 83 e4 e0          	and    rsp,0xffffffffffffffe0
  25df8a:	48 83 ec 20          	sub    rsp,0x20
  25df8e:	c5 fc 28 07          	vmovaps ymm0,YMMWORD PTR [rdi]
  25df92:	c5 fd 6f 0d 66 90 3f 	vmovdqa ymm1,YMMWORD PTR [rip+0x3f9066]        # 657000 <_fini+0xce4>
  25df99:	00 
  25df9a:	48 8d 15 67 50 4c 00 	lea    rdx,[rip+0x4c5067]        # 723008 <global_1_test_backend_5d019e9d_cgu_0>
  25dfa1:	48 89 e7             	mov    rdi,rsp
  25dfa4:	c4 e2 75 36 c8       	vpermd ymm1,ymm1,ymm0
  25dfa9:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25dfad:	c5 fd 6f 0d 6b 90 3f 	vmovdqa ymm1,YMMWORD PTR [rip+0x3f906b]        # 657020 <_fini+0xd04>
  25dfb4:	00 
  25dfb5:	c4 e2 75 36 c8       	vpermd ymm1,ymm1,ymm0
  25dfba:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25dfbe:	c5 fd 6f 0d 7a 90 3f 	vmovdqa ymm1,YMMWORD PTR [rip+0x3f907a]        # 657040 <_fini+0xd24>
  25dfc5:	00 
  25dfc6:	c4 e2 75 36 c8       	vpermd ymm1,ymm1,ymm0
  25dfcb:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25dfcf:	c5 fc 29 04 24       	vmovaps YMMWORD PTR [rsp],ymm0
  25dfd4:	c5 f8 77             	vzeroupper 
  25dfd7:	e8 84 01 00 00       	call   25e160 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj8_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
  25dfdc:	c5 fa 10 00          	vmovss xmm0,DWORD PTR [rax]
  25dfe0:	c9                   	leave  
  25dfe1:	c3                   	ret    
  25dfe2:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  25dfe9:	00 00 00 
  25dfec:	0f 1f 40 00          	nop    DWORD PTR [rax+0x0]

000000000025dff0 <_RNvCs1fmkINL2TB9_12test_backend14tree_reduce_16>:
  25dff0:	55                   	push   rbp
  25dff1:	31 f6                	xor    esi,esi
  25dff3:	48 89 e5             	mov    rbp,rsp
  25dff6:	48 83 e4 c0          	and    rsp,0xffffffffffffffc0
  25dffa:	48 83 ec 40          	sub    rsp,0x40
  25dffe:	c5 fa 10 47 38       	vmovss xmm0,DWORD PTR [rdi+0x38]
  25e003:	c5 fa 10 4f 30       	vmovss xmm1,DWORD PTR [rdi+0x30]
  25e008:	c4 e3 79 21 47 3c 10 	vinsertps xmm0,xmm0,DWORD PTR [rdi+0x3c],0x10
  25e00f:	c4 e3 71 21 4f 34 10 	vinsertps xmm1,xmm1,DWORD PTR [rdi+0x34],0x10
  25e016:	c5 fd 6f 25 82 90 3f 	vmovdqa ymm4,YMMWORD PTR [rip+0x3f9082]        # 6570a0 <_fini+0xd84>
  25e01d:	00 
  25e01e:	48 8d 15 e3 4f 4c 00 	lea    rdx,[rip+0x4c4fe3]        # 723008 <global_1_test_backend_5d019e9d_cgu_0>
  25e025:	c5 f0 16 c8          	vmovlhps xmm1,xmm1,xmm0
  25e029:	c5 fa 10 47 28       	vmovss xmm0,DWORD PTR [rdi+0x28]
  25e02e:	c4 e3 79 21 57 2c 10 	vinsertps xmm2,xmm0,DWORD PTR [rdi+0x2c],0x10
  25e035:	c5 fa 10 47 20       	vmovss xmm0,DWORD PTR [rdi+0x20]
  25e03a:	c4 e3 79 21 47 24 10 	vinsertps xmm0,xmm0,DWORD PTR [rdi+0x24],0x10
  25e041:	c5 f8 16 c2          	vmovlhps xmm0,xmm0,xmm2
  25e045:	c4 e3 7d 18 c1 01    	vinsertf128 ymm0,ymm0,xmm1,0x1
  25e04b:	c4 e2 7d 18 0f       	vbroadcastss ymm1,DWORD PTR [rdi]
  25e050:	c5 fc 58 07          	vaddps ymm0,ymm0,YMMWORD PTR [rdi]
  25e054:	c5 f4 58 4f 20       	vaddps ymm1,ymm1,YMMWORD PTR [rdi+0x20]
  25e059:	48 89 e7             	mov    rdi,rsp
  25e05c:	c4 e3 7d 06 d9 21    	vperm2f128 ymm3,ymm0,ymm1,0x21
  25e062:	c5 e4 58 d8          	vaddps ymm3,ymm3,ymm0
  25e066:	c4 e3 75 06 c0 21    	vperm2f128 ymm0,ymm1,ymm0,0x21
  25e06c:	c4 e2 7d 0c 05 eb 8f 	vpermilps ymm0,ymm0,YMMWORD PTR [rip+0x3f8feb]        # 657060 <_fini+0xd44>
  25e073:	3f 00 
  25e075:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25e079:	c5 fd 6f 0d ff 8f 3f 	vmovdqa ymm1,YMMWORD PTR [rip+0x3f8fff]        # 657080 <_fini+0xd64>
  25e080:	00 
  25e081:	c4 e3 65 06 d0 21    	vperm2f128 ymm2,ymm3,ymm0,0x21
  25e087:	c4 e2 75 36 c8       	vpermd ymm1,ymm1,ymm0
  25e08c:	c5 e4 c6 d2 4e       	vshufps ymm2,ymm3,ymm2,0x4e
  25e091:	c5 ec 58 d3          	vaddps ymm2,ymm2,ymm3
  25e095:	c4 e2 5d 36 db       	vpermd ymm3,ymm4,ymm3
  25e09a:	c4 e3 75 0c cb c0    	vblendps ymm1,ymm1,ymm3,0xc0
  25e0a0:	c5 f4 58 c0          	vaddps ymm0,ymm1,ymm0
  25e0a4:	c4 e3 6d 06 c8 21    	vperm2f128 ymm1,ymm2,ymm0,0x21
  25e0aa:	c4 e3 75 0f ca 04    	vpalignr ymm1,ymm1,ymm2,0x4
  25e0b0:	c5 ec 58 c9          	vaddps ymm1,ymm2,ymm1
  25e0b4:	c4 e3 7d 06 d2 21    	vperm2f128 ymm2,ymm0,ymm2,0x21
  25e0ba:	c4 e3 6d 0f d0 04    	vpalignr ymm2,ymm2,ymm0,0x4
  25e0c0:	c5 fc 58 c2          	vaddps ymm0,ymm0,ymm2
  25e0c4:	c5 fc 29 0c 24       	vmovaps YMMWORD PTR [rsp],ymm1
  25e0c9:	c5 fc 29 44 24 20    	vmovaps YMMWORD PTR [rsp+0x20],ymm0
  25e0cf:	c5 f8 77             	vzeroupper 
  25e0d2:	e8 09 00 00 00       	call   25e0e0 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj10_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
  25e0d7:	c5 fa 10 00          	vmovss xmm0,DWORD PTR [rax]
  25e0db:	c9                   	leave  
  25e0dc:	c3                   	ret    
  25e0dd:	0f 1f 00             	nop    DWORD PTR [rax]

To be noted:

  • the current implementation of simd_shuffle is awkward as GCC expects a mask of the same size as the input vectors while LLVM does not.
  • Inlining attributes are not yet implemented in rustc_codegen_gcc which might explain the calls in this code. Edit: I implemented it but the call is still there afterwards. As the index function does not have an #[inline] attribute, I'm not sure how it's inlined by Rust LLVM. Any idea? Maybe it's implicitely inlineable because it's a generic function? If so, maybe my implementation of inlining is wrong somehow.
  • Not sure what the non SIMD instructions are doing here. Perhaps they are needed for the call instructions or they come from my simd_shuffle implementation. Edit: see my next comment: it is fixed with 1 CGU.

@bjorn3
Copy link
Member

bjorn3 commented Feb 20, 2022

In LLVM any function in the same cgu is inlinable unless you use #[inline(never)]. #[inline] has the dual purpose of ensuring that the function is codegened in the caller's cgu (already happens here because of being generic) and telling the codegen backend that it it should consider inlining adventageous even for bigger functions above the default inline threshold. Most rust code is finetuned for LLVM's behavior.

@antoyo
Copy link

antoyo commented Feb 20, 2022

Ah, no, it was because of the CGUs: I assume rustc has ThinLTO enabled by default.
Here's the resulting code with 1 CGU:

First
000000000025dec0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_1>:
  25dec0:	c5 fa 10 07          	vmovss xmm0,DWORD PTR [rdi]
  25dec4:	c3                   	ret    
  25dec5:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  25decc:	00 00 00 
  25decf:	90                   	nop

000000000025ded0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_2>:
  25ded0:	c5 fa 7e 0f          	vmovq  xmm1,QWORD PTR [rdi]
  25ded4:	c5 fa 16 c1          	vmovshdup xmm0,xmm1
  25ded8:	c5 f8 58 c1          	vaddps xmm0,xmm0,xmm1
  25dedc:	c3                   	ret    
  25dedd:	0f 1f 00             	nop    DWORD PTR [rax]

000000000025dee0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_4>:
  25dee0:	c4 e3 79 04 0f f5    	vpermilps xmm1,XMMWORD PTR [rdi],0xf5
  25dee6:	c5 f0 58 0f          	vaddps xmm1,xmm1,XMMWORD PTR [rdi]
  25deea:	c4 e3 79 04 c1 aa    	vpermilps xmm0,xmm1,0xaa
  25def0:	c5 f8 58 c1          	vaddps xmm0,xmm0,xmm1
  25def4:	c3                   	ret    
  25def5:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  25defc:	00 00 00 
  25deff:	90                   	nop

000000000025df00 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_8>:
  25df00:	c4 e3 7d 04 07 f5    	vpermilps ymm0,YMMWORD PTR [rdi],0xf5
  25df06:	c5 fc 58 07          	vaddps ymm0,ymm0,YMMWORD PTR [rdi]
  25df0a:	c4 e3 7d 04 c8 aa    	vpermilps ymm1,ymm0,0xaa
  25df10:	c5 f4 58 c8          	vaddps ymm1,ymm1,ymm0
  25df14:	c4 e3 7d 04 c1 00    	vpermilps ymm0,ymm1,0x0
  25df1a:	c4 e3 7d 06 c0 11    	vperm2f128 ymm0,ymm0,ymm0,0x11
  25df20:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25df24:	c5 f8 77             	vzeroupper 
  25df27:	c3                   	ret    
  25df28:	0f 1f 84 00 00 00 00 	nop    DWORD PTR [rax+rax*1+0x0]
  25df2f:	00 

000000000025df30 <_RNvCsacpq7FklLKk_12test_backend14tree_reduce_16>:
  25df30:	c5 fa 10 7f 1c       	vmovss xmm7,DWORD PTR [rdi+0x1c]
  25df35:	c5 fa 10 5f 14       	vmovss xmm3,DWORD PTR [rdi+0x14]
  25df3a:	c5 fa 10 77 0c       	vmovss xmm6,DWORD PTR [rdi+0xc]
  25df3f:	c5 fa 10 4f 04       	vmovss xmm1,DWORD PTR [rdi+0x4]
  25df44:	c5 fa 10 6f 3c       	vmovss xmm5,DWORD PTR [rdi+0x3c]
  25df49:	c5 fa 10 57 34       	vmovss xmm2,DWORD PTR [rdi+0x34]
  25df4e:	c5 c0 14 ff          	vunpcklps xmm7,xmm7,xmm7
  25df52:	c5 e0 14 db          	vunpcklps xmm3,xmm3,xmm3
  25df56:	c5 fa 10 67 2c       	vmovss xmm4,DWORD PTR [rdi+0x2c]
  25df5b:	c5 c8 14 f6          	vunpcklps xmm6,xmm6,xmm6
  25df5f:	c5 f0 14 c9          	vunpcklps xmm1,xmm1,xmm1
  25df63:	c5 e0 16 df          	vmovlhps xmm3,xmm3,xmm7
  25df67:	c5 fa 10 47 24       	vmovss xmm0,DWORD PTR [rdi+0x24]
  25df6c:	c5 f0 16 ce          	vmovlhps xmm1,xmm1,xmm6
  25df70:	c5 d0 14 ed          	vunpcklps xmm5,xmm5,xmm5
  25df74:	c5 e8 14 d2          	vunpcklps xmm2,xmm2,xmm2
  25df78:	c5 d8 14 e4          	vunpcklps xmm4,xmm4,xmm4
  25df7c:	c5 e8 16 d5          	vmovlhps xmm2,xmm2,xmm5
  25df80:	c4 e3 75 18 cb 01    	vinsertf128 ymm1,ymm1,xmm3,0x1
  25df86:	c5 f8 14 c0          	vunpcklps xmm0,xmm0,xmm0
  25df8a:	c5 f4 58 0f          	vaddps ymm1,ymm1,YMMWORD PTR [rdi]
  25df8e:	c5 f8 16 c4          	vmovlhps xmm0,xmm0,xmm4
  25df92:	c4 e3 7d 18 c2 01    	vinsertf128 ymm0,ymm0,xmm2,0x1
  25df98:	c5 fc 58 47 20       	vaddps ymm0,ymm0,YMMWORD PTR [rdi+0x20]
  25df9d:	c4 e3 7d 04 d1 aa    	vpermilps ymm2,ymm1,0xaa
  25dfa3:	c5 ec 58 d1          	vaddps ymm2,ymm2,ymm1
  25dfa7:	c4 e3 7d 04 c8 aa    	vpermilps ymm1,ymm0,0xaa
  25dfad:	c5 f4 58 c0          	vaddps ymm0,ymm1,ymm0
  25dfb1:	c4 e3 7d 04 c8 00    	vpermilps ymm1,ymm0,0x0
  25dfb7:	c4 e3 75 06 c9 11    	vperm2f128 ymm1,ymm1,ymm1,0x11
  25dfbd:	c5 f4 58 c8          	vaddps ymm1,ymm1,ymm0
  25dfc1:	c4 e3 7d 04 c2 00    	vpermilps ymm0,ymm2,0x0
  25dfc7:	c4 e3 7d 06 c0 11    	vperm2f128 ymm0,ymm0,ymm0,0x11
  25dfcd:	c5 fc 58 c2          	vaddps ymm0,ymm0,ymm2
  25dfd1:	c4 e2 7d 18 c9       	vbroadcastss ymm1,xmm1
  25dfd6:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25dfda:	c5 f8 77             	vzeroupper 
  25dfdd:	c3                   	ret    
Second
000000000025dec0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_1>:
  25dec0:	c5 fa 10 07          	vmovss xmm0,DWORD PTR [rdi]
  25dec4:	c3                   	ret    
  25dec5:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  25decc:	00 00 00 
  25decf:	90                   	nop

000000000025ded0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_2>:
  25ded0:	c5 fa 7e 0f          	vmovq  xmm1,QWORD PTR [rdi]
  25ded4:	c5 f0 c6 c1 e1       	vshufps xmm0,xmm1,xmm1,0xe1
  25ded9:	c5 f8 58 c1          	vaddps xmm0,xmm0,xmm1
  25dedd:	c3                   	ret    
  25dede:	66 90                	xchg   ax,ax

000000000025dee0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_4>:
  25dee0:	c4 e3 79 04 0f 0e    	vpermilps xmm1,XMMWORD PTR [rdi],0xe
  25dee6:	c5 f0 58 0f          	vaddps xmm1,xmm1,XMMWORD PTR [rdi]
  25deea:	c4 e3 79 04 c1 39    	vpermilps xmm0,xmm1,0x39
  25def0:	c5 f8 58 c1          	vaddps xmm0,xmm0,xmm1
  25def4:	c3                   	ret    
  25def5:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  25defc:	00 00 00 
  25deff:	90                   	nop

000000000025df00 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_8>:
  25df00:	c5 fc 28 07          	vmovaps ymm0,YMMWORD PTR [rdi]
  25df04:	c5 fd 6f 0d f4 a0 42 	vmovdqa ymm1,YMMWORD PTR [rip+0x42a0f4]        # 688000 <_fini+0xccc>
  25df0b:	00 
  25df0c:	c4 e2 75 36 c8       	vpermd ymm1,ymm1,ymm0
  25df11:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25df15:	c5 fd 6f 0d 03 a1 42 	vmovdqa ymm1,YMMWORD PTR [rip+0x42a103]        # 688020 <_fini+0xcec>
  25df1c:	00 
  25df1d:	c4 e2 75 36 c8       	vpermd ymm1,ymm1,ymm0
  25df22:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25df26:	c5 fd 6f 0d 12 a1 42 	vmovdqa ymm1,YMMWORD PTR [rip+0x42a112]        # 688040 <_fini+0xd0c>
  25df2d:	00 
  25df2e:	c4 e2 75 36 c8       	vpermd ymm1,ymm1,ymm0
  25df33:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25df37:	c5 f8 77             	vzeroupper 
  25df3a:	c3                   	ret    
  25df3b:	0f 1f 44 00 00       	nop    DWORD PTR [rax+rax*1+0x0]

000000000025df40 <_RNvCsacpq7FklLKk_12test_backend14tree_reduce_16>:
  25df40:	c5 fa 10 47 38       	vmovss xmm0,DWORD PTR [rdi+0x38]
  25df45:	c4 e3 79 21 4f 3c 10 	vinsertps xmm1,xmm0,DWORD PTR [rdi+0x3c],0x10
  25df4c:	c5 fa 10 47 30       	vmovss xmm0,DWORD PTR [rdi+0x30]
  25df51:	c4 e3 79 21 47 34 10 	vinsertps xmm0,xmm0,DWORD PTR [rdi+0x34],0x10
  25df58:	c5 fd 6f 25 40 a1 42 	vmovdqa ymm4,YMMWORD PTR [rip+0x42a140]        # 6880a0 <_fini+0xd6c>
  25df5f:	00 
  25df60:	c5 f8 16 c1          	vmovlhps xmm0,xmm0,xmm1
  25df64:	c5 fa 10 4f 28       	vmovss xmm1,DWORD PTR [rdi+0x28]
  25df69:	c4 e3 71 21 57 2c 10 	vinsertps xmm2,xmm1,DWORD PTR [rdi+0x2c],0x10
  25df70:	c5 fa 10 4f 20       	vmovss xmm1,DWORD PTR [rdi+0x20]
  25df75:	c4 e3 71 21 4f 24 10 	vinsertps xmm1,xmm1,DWORD PTR [rdi+0x24],0x10
  25df7c:	c5 f0 16 ca          	vmovlhps xmm1,xmm1,xmm2
  25df80:	c5 fd 6f 15 f8 a0 42 	vmovdqa ymm2,YMMWORD PTR [rip+0x42a0f8]        # 688080 <_fini+0xd4c>
  25df87:	00 
  25df88:	c4 e3 75 18 c8 01    	vinsertf128 ymm1,ymm1,xmm0,0x1
  25df8e:	c4 e2 7d 18 07       	vbroadcastss ymm0,DWORD PTR [rdi]
  25df93:	c5 f4 58 0f          	vaddps ymm1,ymm1,YMMWORD PTR [rdi]
  25df97:	c5 fc 58 47 20       	vaddps ymm0,ymm0,YMMWORD PTR [rdi+0x20]
  25df9c:	c4 e3 75 06 d8 21    	vperm2f128 ymm3,ymm1,ymm0,0x21
  25dfa2:	c5 e4 58 d9          	vaddps ymm3,ymm3,ymm1
  25dfa6:	c4 e3 7d 06 c9 21    	vperm2f128 ymm1,ymm0,ymm1,0x21
  25dfac:	c4 e2 75 0c 0d ab a0 	vpermilps ymm1,ymm1,YMMWORD PTR [rip+0x42a0ab]        # 688060 <_fini+0xd2c>
  25dfb3:	42 00 
  25dfb5:	c5 f4 58 c8          	vaddps ymm1,ymm1,ymm0
  25dfb9:	c4 e3 65 06 c1 21    	vperm2f128 ymm0,ymm3,ymm1,0x21
  25dfbf:	c4 e2 6d 36 d1       	vpermd ymm2,ymm2,ymm1
  25dfc4:	c5 e4 c6 c0 4e       	vshufps ymm0,ymm3,ymm0,0x4e
  25dfc9:	c5 fc 58 c3          	vaddps ymm0,ymm0,ymm3
  25dfcd:	c4 e2 5d 36 db       	vpermd ymm3,ymm4,ymm3
  25dfd2:	c4 e3 6d 0c d3 c0    	vblendps ymm2,ymm2,ymm3,0xc0
  25dfd8:	c5 ec 58 c9          	vaddps ymm1,ymm2,ymm1
  25dfdc:	c4 e3 7d 06 c9 21    	vperm2f128 ymm1,ymm0,ymm1,0x21
  25dfe2:	c4 e3 75 0f c8 04    	vpalignr ymm1,ymm1,ymm0,0x4
  25dfe8:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  25dfec:	c5 f8 77             	vzeroupper 
  25dfef:	c3                   	ret    

It looks better now, but the tree_reduce_16 is much bigger. I wonder if this is because of my simd_shuffle implementation.

@bjorn3
Copy link
Member

bjorn3 commented Feb 20, 2022

Ah, no, it was because of the CGUs: I assume rustc has ThinLTO enabled by default.

Right, in release mode local-thinlto is enabled by default. This is ThinLTO applied to all codegen units in a single crate, unlike regular ThinLTO which applies to all crates together.

@programmerjake
Copy link
Member

programmerjake commented Feb 20, 2022

i could be totally wrong, but can't you just call __builtin_shufflevector rather than _mm_shuffle_epi8? maybe that's why the 16-element one gives terrible results.
assuming the shuffle vector implementation you're currently using is:
https://github.com/rust-lang/rust/blob/6250d5a08cf0870d3655fa98b83718bc01ff6f45/compiler/rustc_codegen_gcc/src/builder.rs#L1439

@workingjubilee
Copy link
Member Author

workingjubilee commented Feb 21, 2022

Thank you!

To be noted:

  • the current implementation of simd_shuffle is awkward as GCC expects a mask of the same size as the input vectors while LLVM does not.

Ah! Is there any summary somewhere of what GCC exposes in terms of generalized vector logic?

@antoyo
Copy link

antoyo commented Feb 21, 2022

i could be totally wrong, but can't you just call __builtin_shufflevector rather than _mm_shuffle_epi8? maybe that's why the 16-element one gives terrible results.
assuming the shuffle vector implementation you're currently using is:
https://github.com/rust-lang/rust/blob/6250d5a08cf0870d3655fa98b83718bc01ff6f45/compiler/rustc_codegen_gcc/src/builder.rs#L1439

No, the implementation I use is there.
I use the equivalent of __builtin_shuffle which is a front-end intrinsic. In terms of GIMPLE, this is a VEC_PERM_EXPR.
The problem with it is that it requires its mask to be the same size as the input vectors.

Edit: I just realized you mentioned __builtin_shufflevector and not __builtin_shuffle, which does not seem to have the same limitation in terms of length. I'll try this one. Thanks!

Edit 2: It seems __builtin_shufflevector is also built on VEC_PERM_EXPR, so it pads the elements, similarly to what I do. It seems the elements are padded differently, so perhaps I should do it the same way.

@antoyo
Copy link

antoyo commented Feb 21, 2022

Thank you!

To be noted:

  • the current implementation of simd_shuffle is awkward as GCC expects a mask of the same size as the input vectors while LLVM does not.

Ah! Is there any summary somewhere of what GCC exposes in terms of generalized vector logic?

Here it is.

@workingjubilee
Copy link
Member Author

Reading that, I am curious, because it doesn't specify one way or another, and I would like to believe the naive possibility: Does __builtin_shuffle accept non-constant shuffle masks? These certainly seem to emit well-behaved "dynamic" (register operand) shuffles when I look at the assembly:

typedef int v4si __attribute__ ((vector_size (16)));

v4si shuffle1(v4si a, v4si mask) {
    return __builtin_shuffle (a, mask);
}

v4si shuffle2(v4si a, v4si b, v4si mask) {
    return __builtin_shuffle (a, b, mask);
}

@antoyo
Copy link

antoyo commented Feb 23, 2022

It seems to accept non-constant shuffle masks.
I do remember having seen some intrinsics requiring constant arguments, but I don't remember which one it is and it might not be related to shuffle.

@programmerjake
Copy link
Member

__builtin_shufflevector is the one with const arguments...it comes from clang, probably because llvm's shuffle instruction requires const arguments.

@antoyo
Copy link

antoyo commented Mar 20, 2022

I fixed the vector shuffle in rustc_codegen_gcc and I realized I was also missing the GCC cli argument -mavx512f which makes the result much better. If you happen to know any other GCC flags I could be missing, please mention it!
Here's the new result:

First
0000000000267860 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_1>:
  267860:	c5 fa 10 07          	vmovss xmm0,DWORD PTR [rdi]
  267864:	c3                   	ret    
  267865:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  26786c:	00 00 00 
  26786f:	90                   	nop

0000000000267870 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_2>:
  267870:	c5 fa 7e 0f          	vmovq  xmm1,QWORD PTR [rdi]
  267874:	c5 fa 16 c1          	vmovshdup xmm0,xmm1
  267878:	c5 f8 58 c1          	vaddps xmm0,xmm0,xmm1
  26787c:	c3                   	ret    
  26787d:	0f 1f 00             	nop    DWORD PTR [rax]

0000000000267880 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_4>:
  267880:	c4 e3 79 04 0f f5    	vpermilps xmm1,XMMWORD PTR [rdi],0xf5
  267886:	c5 f0 58 0f          	vaddps xmm1,xmm1,XMMWORD PTR [rdi]
  26788a:	c4 e3 79 04 c1 aa    	vpermilps xmm0,xmm1,0xaa
  267890:	c5 f8 58 c1          	vaddps xmm0,xmm0,xmm1
  267894:	c3                   	ret    
  267895:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  26789c:	00 00 00 
  26789f:	90                   	nop

00000000002678a0 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_8>:
  2678a0:	c4 e3 7d 04 07 f5    	vpermilps ymm0,YMMWORD PTR [rdi],0xf5
  2678a6:	c5 fc 58 07          	vaddps ymm0,ymm0,YMMWORD PTR [rdi]
  2678aa:	c4 e3 7d 04 c8 aa    	vpermilps ymm1,ymm0,0xaa
  2678b0:	c5 f4 58 c8          	vaddps ymm1,ymm1,ymm0
  2678b4:	c4 e3 7d 04 c1 00    	vpermilps ymm0,ymm1,0x0
  2678ba:	c4 e3 7d 06 c0 11    	vperm2f128 ymm0,ymm0,ymm0,0x11
  2678c0:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  2678c4:	c5 f8 77             	vzeroupper 
  2678c7:	c3                   	ret    
  2678c8:	0f 1f 84 00 00 00 00 	nop    DWORD PTR [rax+rax*1+0x0]
  2678cf:	00 

00000000002678d0 <_RNvCs49eEDcGqTSg_12test_backend14tree_reduce_16>:
  2678d0:	62 f3 7d 48 04 0f f5 	vpermilps zmm1,ZMMWORD PTR [rdi],0xf5
  2678d7:	62 f1 74 48 58 0f    	vaddps zmm1,zmm1,ZMMWORD PTR [rdi]
  2678dd:	b8 08 00 00 00       	mov    eax,0x8
  2678e2:	62 f3 7d 48 04 c1 aa 	vpermilps zmm0,zmm1,0xaa
  2678e9:	62 f1 7c 48 58 c1    	vaddps zmm0,zmm0,zmm1
  2678ef:	62 f1 7d 48 6f 0d 07 	vmovdqa32 zmm1,ZMMWORD PTR [rip+0x3ef707]        # 657000 <_fini+0xb88>
  2678f6:	f7 3e 00 
  2678f9:	62 f2 75 48 16 c8    	vpermps zmm1,zmm1,zmm0
  2678ff:	62 f1 74 48 58 c8    	vaddps zmm1,zmm1,zmm0
  267905:	62 f2 7d 48 7c c0    	vpbroadcastd zmm0,eax
  26790b:	62 f2 7d 48 16 c1    	vpermps zmm0,zmm0,zmm1
  267911:	62 f1 7c 48 58 c1    	vaddps zmm0,zmm0,zmm1
  267917:	c5 f8 77             	vzeroupper 
  26791a:	c3                   	ret    
Second
0000000000267860 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_1>:
  267860:	c5 fa 10 07          	vmovss xmm0,DWORD PTR [rdi]
  267864:	c3                   	ret    
  267865:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  26786c:	00 00 00 
  26786f:	90                   	nop

0000000000267870 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_2>:
  267870:	c5 fa 7e 0f          	vmovq  xmm1,QWORD PTR [rdi]
  267874:	c5 f0 c6 c1 e1       	vshufps xmm0,xmm1,xmm1,0xe1
  267879:	c5 f8 58 c1          	vaddps xmm0,xmm0,xmm1
  26787d:	c3                   	ret    
  26787e:	66 90                	xchg   ax,ax

0000000000267880 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_4>:
  267880:	c4 e3 79 04 0f 0e    	vpermilps xmm1,XMMWORD PTR [rdi],0xe
  267886:	c5 f0 58 0f          	vaddps xmm1,xmm1,XMMWORD PTR [rdi]
  26788a:	c4 e3 79 04 c1 39    	vpermilps xmm0,xmm1,0x39
  267890:	c5 f8 58 c1          	vaddps xmm0,xmm0,xmm1
  267894:	c3                   	ret    
  267895:	66 2e 0f 1f 84 00 00 	cs nop WORD PTR [rax+rax*1+0x0]
  26789c:	00 00 00 
  26789f:	90                   	nop

00000000002678a0 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_8>:
  2678a0:	c5 fd 6f 0d 58 f7 3e 	vmovdqa ymm1,YMMWORD PTR [rip+0x3ef758]        # 657000 <_fini+0xb60>
  2678a7:	00 
  2678a8:	c5 fc 28 07          	vmovaps ymm0,YMMWORD PTR [rdi]
  2678ac:	c4 e2 75 36 c8       	vpermd ymm1,ymm1,ymm0
  2678b1:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  2678b5:	c5 fd 6f 0d 63 f7 3e 	vmovdqa ymm1,YMMWORD PTR [rip+0x3ef763]        # 657020 <_fini+0xb80>
  2678bc:	00 
  2678bd:	c4 e2 75 36 c8       	vpermd ymm1,ymm1,ymm0
  2678c2:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  2678c6:	c5 fd 6f 0d 72 f7 3e 	vmovdqa ymm1,YMMWORD PTR [rip+0x3ef772]        # 657040 <_fini+0xba0>
  2678cd:	00 
  2678ce:	c4 e2 75 36 c8       	vpermd ymm1,ymm1,ymm0
  2678d3:	c5 fc 58 c1          	vaddps ymm0,ymm0,ymm1
  2678d7:	c5 f8 77             	vzeroupper 
  2678da:	c3                   	ret    
  2678db:	0f 1f 44 00 00       	nop    DWORD PTR [rax+rax*1+0x0]

00000000002678e0 <_RNvCs49eEDcGqTSg_12test_backend14tree_reduce_16>:
  2678e0:	62 f1 7d 48 6f 0d 96 	vmovdqa32 zmm1,ZMMWORD PTR [rip+0x3ef796]        # 657080 <_fini+0xbe0>
  2678e7:	f7 3e 00 
  2678ea:	62 f1 7c 48 28 07    	vmovaps zmm0,ZMMWORD PTR [rdi]
  2678f0:	62 f2 75 48 16 c8    	vpermps zmm1,zmm1,zmm0
  2678f6:	62 f1 74 48 58 c8    	vaddps zmm1,zmm1,zmm0
  2678fc:	62 f1 7d 48 6f 05 ba 	vmovdqa32 zmm0,ZMMWORD PTR [rip+0x3ef7ba]        # 6570c0 <_fini+0xc20>
  267903:	f7 3e 00 
  267906:	62 f2 7d 48 16 c1    	vpermps zmm0,zmm0,zmm1
  26790c:	62 f1 7c 48 58 c1    	vaddps zmm0,zmm0,zmm1
  267912:	62 f1 7d 48 6f 0d e4 	vmovdqa32 zmm1,ZMMWORD PTR [rip+0x3ef7e4]        # 657100 <_fini+0xc60>
  267919:	f7 3e 00 
  26791c:	62 f2 75 48 16 c8    	vpermps zmm1,zmm1,zmm0
  267922:	62 f1 74 48 58 c8    	vaddps zmm1,zmm1,zmm0
  267928:	62 f1 7d 48 6f 05 0e 	vmovdqa32 zmm0,ZMMWORD PTR [rip+0x3ef80e]        # 657140 <_fini+0xca0>
  26792f:	f8 3e 00 
  267932:	62 f2 7d 48 16 c1    	vpermps zmm0,zmm0,zmm1
  267938:	62 f1 7c 48 58 c1    	vaddps zmm0,zmm0,zmm1
  26793e:	c5 f8 77             	vzeroupper 
  267941:	c3                   	ret    

@pervognsen
Copy link

pervognsen commented Jul 28, 2023

I just started experimenting with std::simd and after studying code quality it became apparent that reduce's left-associative ordering is the biggest issue right now for efficiently implementing common-place operations like dot products. I ended up writing a generic reduce_symmetric function to do the job:

pub fn reduce_symmetric<T, const N: usize>(a: Simd<T, N>, mut f: impl FnMut(Simd<T, N>, Simd<T, N>) -> Simd<T, N>) -> T
where
    LaneCount<N>: SupportedLaneCount,
    T: SimdElement,
{
    assert!(N.is_power_of_two());
    (0..N.ilog2()).fold(a, |a, _| {
        let (lo, hi) = a.interleave(a);
        f(lo, hi)
    })[0]
}

I have a bunch of side-by-side studies of dot products here that might be interesting: https://godbolt.org/z/xzhnr1K83

The issue with ordered reduction becomes greatly aggravated as the vectors get wider, especially in cases when you want to use f32x32 or f32x64 to engage more parallel vector registers to overcome latency (which also lets you take advantage of AVX512 while lowering to multiple vectors for narrower-width vector ISAs like SSE, NEON and AVX2). For example, if f32x32 gets lowered to 8x f32x4 on SSE and NEON targets, then to do a reduce_sum on a f32x32 forces eight separate horizontal reductions (and each horizontal reduction is internally suboptimal) rather than only one horizontal reduction at the very end when doing the final reduction from f32x4 to f32.

Because of this, I believe the API should be restructured around the performance principle of least surprise. As a concrete suggestion: reduce_op should use the well-defined power-of-two factorization (the interleave variant so wider-than-native vector lowering can exploit vertical adds for reduction) and reduce_op_ordered would then offer the current left-chained semantics. In terms of the least surprise principle, the main issues I can see right now with this suggestion:

  1. Someone is expecting bit-exact agreement with the left fold semantics of f32::sum. This could violate least-surprise on a semantics level, but someone who cares about that level of detail should read the docs, and I think calling it reduce_sum rather than sum makes it apparent that it's something different from what you're used to in std.
  2. When lowering vectors to scalars, the symmetric reduction entails much higher register pressure in cases where the operands can be loaded directly from memory or otherwise synthesized locally (if all the scalar components are already in registers, e.g. from an earlier loop, then they can be folded in place). I don't know how much of a real concern this is, or how much weight is given or should be given to the performance of vector-to-scalar lowering. My personal opinion is that f32x4 should be treated as the lower bound for lowering when considering "performance portability" for a portable SIMD library. For example, f32x32 can lower to 2x f32x16 (AVX512), 4x f32x8 (AVX) or 8x f32x4 (SSE, NEON) while fitting very comfortably within those ISA's register limits (16 registers on the most constrained target, x86, so 8 registers leaves room for temporaries/auxiliaries).

Footnote: I initially considered the deinterleave variant of symmetric reduction which has the "advantage" of only reassociating rather than commuting terms. However, it does not play well when lowering from wider-than-native vectors, e.g. for f32x8 on a f32x4 target where a f32x8 value is represented by a pair of f32x4 values we want to be able to do a vertical (vector-to-vector) reduction of the pair and then a single horizontal (vector-to-scalar) reduction. The deinterleave variant would perform a horizontal reduction on each of the vectors and then add the resulting scalars.

Thoughts?

@calebzulawski
Copy link
Member

I'm okay with renaming to reduce_sum--I don't have a strong preference either way. I believe some time ago that was the original name.

I don't have a strong opinion on the actual association order, but I don't think left-to-right makes much sense other than being predictable. Maybe there are some unique cases where that's necessary.

One question we need to answer is whether the reductions need to be the same on every architecture. IEEE 754 hints that they should be numerically stable (it specifically says they should avoid underflow or overflow) but leaves it implementation defined. That leaves the decision to us, to determine what we consider "portable".

If we decide to make it consistent across architectures, I think we should still leave the exact implementation unspecified. In the future we may decide the specific algorithm performs poorly on some architecture, and we devise a new one. If a user really needs to know the exact algorithm, they should probably implement their own.

One option would be an API like the following (I can't remember if you can have default const generics)

enum ReduceOrder {
    Stable,
    Sequence,
    Any,
}

fn reduce_sum<const ORDER: ReduceOrder = Stable>(self) { ... }

@pervognsen
Copy link

pervognsen commented Jul 28, 2023

The IEEE-754-2008 standardese on vector reductions is indeed left purposely vague to the point where I believe they don't say anything about precision at all. I'm not sure if taking advantage of that extreme leeway is advised as the default behavior, but at least it gives enough elbow room that any reasonable solution will meet the requirements, including what we're discussing.

I can see the argument for leaving reduction order unspecified for implementation flexibility. On the other hand, there is also a strong argument for making the "default" APIs provide well-specified behavior so you can achieve reproducibility across the major architectures for the fundamental core of the API. That's possible with IEEE-754-2008 (modulo NaN bit pattern propagation and the details around pre-rounding vs post-rounding tininess detection for FMA, but you can avoid FMA if that's an issue) and seems like a very worthwhile goal to me, even if it cannot practically be extended to every last part of the API (e.g. algebraic and transcendental functions).

Making it consistent across architectures but not across standard library versions is definitely better than leaving it totally unspecified and might be a sufficient compromise. Worst case, anyone who wants a specific ordering that is long-term stable can always ask for it explicitly by picking what function variants they call.

@workingjubilee
Copy link
Member Author

If we decide to make it consistent across architectures, I think we should still leave the exact implementation unspecified. In the future we may decide the specific algorithm performs poorly on some architecture, and we devise a new one. If a user really needs to know the exact algorithm, they should probably implement their own.

Ah, I remember the discussion we had about this, and I think I had misunderstood your desire to leave it underspecified as for "let's vary it from architecture to architecture" reasons instead of "let's allow the freedom to vary it from version to version" reasons... obviously if we have something that has a better average performance on most architectures, we should consider it, so I understand leaving that unstable for longer. If we have consensus on using a consistent algorithm from architecture to architecture, then I agree we should do that.

@programmerjake
Copy link
Member

programmerjake commented Jul 28, 2023

as mentioned previously, I proposed using a tree-reduction with power-of-2 semantics since that's reasonably efficient on most architectures and SVP64 has a specific set of instructions for that exact pattern.

algorithm used for SVP64

simplified algorithm:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=9672136e5c067b83e831859cb9013002

pub fn tree_reduce_SVP64<T>(
    mut input: Vec<T>,
    mut f: impl FnMut(&T, &T) -> T,
) -> Option<T> {
    let mut step = 1;
    while step < input.len() {
        step *= 2;
        for i in (0..input.len()).step_by(step) {
            let other = i + step / 2;
            if other < input.len() {
                let v = f(&input[i], &input[other]);
                input[i] = v;
            }
        }
    }
    input.into_iter().next()
}

#[test]
fn test() {
    fn run(len: usize) -> String {
        let mut input = vec![];
        for i in 0..len {
            input.push(format!("x{i}"));
        }
        tree_reduce_SVP64(input, |l, r| format!("({l} + {r})")).unwrap_or_default()
    }
    assert_eq!(run(0), "");
    assert_eq!(run(1), "x0");
    assert_eq!(run(2), "(x0 + x1)");
    assert_eq!(run(3), "((x0 + x1) + x2)");
    assert_eq!(run(4), "((x0 + x1) + (x2 + x3))");
    assert_eq!(run(5), "(((x0 + x1) + (x2 + x3)) + x4)");
    assert_eq!(run(6), "(((x0 + x1) + (x2 + x3)) + (x4 + x5))");
    assert_eq!(run(7), "(((x0 + x1) + (x2 + x3)) + ((x4 + x5) + x6))");
    assert_eq!(run(8), "(((x0 + x1) + (x2 + x3)) + ((x4 + x5) + (x6 + x7)))");
}

@pervognsen
Copy link

pervognsen commented Jul 29, 2023

Right, that tree reduction is what I referred to as the deinterleave reduction in my footnote. It can be a tiny bit better than the interleave reduction on some ISAs for native-width vector reduction (on NEON interleave is better even for the native-width case), but in general you have an increasingly serious code quality issue when you lower from wider-than-native vector types, and IMHO that kind of lowering should be a top priority for a portable SIMD library since you want to avoid people being pushed towards target-specific cfg flags to tune vector widths if at all possible.

Here's some side-by-side comparisons of interleave vs deinterleave sum reduction for f32x4, f32x8, f32x16, f32x32 and f32x64 on SSE, AVX, AVX512 and NEON: https://godbolt.org/z/jqTKzr3hx

As you can see [1], the code generated for the deinterleave sums gets more and more problematic, to the point where reduce_sum_deinterleave_f32x64 on SSE generates a 6-iteration loop with a massive body since the compiler has had enough.

I don't have any first-hand knowledge of SVP64, but I would be a little surprised (maybe you could explain more?) if there isn't an equally or almost equally fast way to do interleaves on SVP64 as deinterleaves since they're two of the fundamental length-parametric permutations on power-of-two vector pairs and constantly show up in applications.

[1] Note that just looking at the code size for the lowered deinterleaves on AVX2 and AVX512 is misleading since the generated code relies heavily on the hadd (horizontal add) instruction which despite being one instruction is costly in terms of both latency and throughput, fairly comparable to an equivalent instruction sequence (it is microsequenced or microcoded at least on some uarches) but with less instruction scheduling freedom which AFAIK can sometimes be a net negative.

@programmerjake
Copy link
Member

I don't have any first-hand knowledge of SVP64, but I would be a little surprised (maybe you could explain more?) if there isn't an equally or almost equally fast way to do interleaves on SVP64 as deinterleaves since they're two of the fundamental length-parametric permutations on power-of-two vector pairs and constantly show up in applications.

there are fast-ish interleave/deinterleave ops, however the reduction algorithm I mentioned is built into the hardware such that any reduction is two setup instructions plus one vector instruction that runs that full reduction algorithm -- making it substantially faster than the separate vector instructions that would be needed for each stage of reduction. that works for all supported vector lengths, so for the current draft version of SVP64 that is 0-64 elements including non-powers-of-2.

assembly to fadd-reduce the f64x17 vector in registers f4,f5,f6,f7,...f20:

# set SVSHAPE* registers to tree-reduction mode for 17 elements
svshape 17, 1, 1, 7, 0
# enable remapping element numbers according to SVSHAPE* for next sv.* instruction
svremap 0x1F, 0, 1, 0, 0, 0, 0
# run the vector reduction, result in f4
sv.fadd *f4, *f4, *f4

@thomcc
Copy link
Member

thomcc commented Jul 30, 2023

in general you have an increasingly serious code quality issue when you lower from wider-than-native vector types, and IMHO that kind of lowering should be a top priority for a portable SIMD library since you want to avoid people being pushed towards target-specific cfg flags to tune vector widths if at all possible.

This is a pretty compelling argument to me, even if it means things are slightly less efficient on some architectures.

@programmerjake
Copy link
Member

programmerjake commented Jul 30, 2023

currently llvm tends to have code quality issues when the vector length is substantially larger than native, regardless of which algorithm is being looked at (this is not specific to reductions), so reduction having bad codegen when using a vector length 16 times larger than native is imho not really a problem caused by the specific Simd-reduction algorithm chosen. imho the user should manually split their input data into reasonably-sized chunks (so probably f32x16 at the most for x86) and then use Simd on one chunk at a time.

@pervognsen
Copy link

pervognsen commented Jul 30, 2023

so reduction having bad codegen when using a vector length 16 times larger than native is imho not really a problem caused by the specific reduction algorithm chosen

Yeah, the f32x64 case was an investigation of the extreme case since that's currently the highest lane count available. The f32x32 and f32x16 cases are more practical due to the lower induced register pressure from register splitting.

I don't have enough experience yet with LLVM in this specific area to weigh in on code quality issues with width lowering more generally, but in this specific case we're discussing there is nothing LLVM or even a human could do given how the reduction order is specified. So it does seem worth singling out. And if the code quality issues with width lowering are so severe in general, it makes it very hard to use something like std::simd for widths beyond a common denominator f32x4 if it means suffering severe regressions on native f32x4 targets.

Has anyone involved with the project tried to characterize these codegen issues systematically to guide these sorts of decisions? Otherwise, as I continue kicking the tires of std::simd I'll try to gather some data.

@HadrienG2
Copy link

HadrienG2 commented Jul 30, 2023

I think this sort of issue is best resolved by having some kind of API that tells you what is the optimal vector width on the active hardware.

I once asked for something like this to be added to std::simd, but it was deemed out of scope. However, the target_features crate provides such a const fn nowadays.

@pervognsen
Copy link

pervognsen commented Jul 30, 2023

there are fast-ish interleave/deinterleave ops, however the reduction algorithm I mentioned is built into the hardware such that any reduction is two setup instructions plus one vector instruction that runs that full reduction algorithm

Thanks, that's the part I had missed. It might also be worth looking into what affordances and preferences SVE and RISC-V's vector extension have when it comes to reductions since these all seem in the general bucket of scalable vector ISAs that hearken back more to Cray-style vectors than traditional SIMD ISAs.

@pervognsen
Copy link

pervognsen commented Jul 30, 2023

I think this sort of issue is best resolved by having some kind of API that tells you what is the optimal vector width on the active hardware.

Offering such an API might be helpful but it has some downsides:

  1. Numerical results are now hardware dependent (specifically, on the queried vector width) for fast reductions and other algorithms that rely on width-dependent reassociation and commutation for vector-level and instruction-level parallelism. Although as @programmerjake mentions, it probably makes sense to define the chunk width separately from the vector width, so that the chunk width determines the specified behavior of the operation and the per-chunk processing is then invariantly computed with vectors and you let the compiler make the best of it. I'll try some more experiments with that (the dot_f32x16x8 function in the godbolt link in my first post is an example) and see what the codegen looks like.
  2. It seems really annoying to consume variable vector widths with runtime queries. You need to make your std::simd-based code use const generics for the lane count (that part is not too bad) but now you also need to pre-instantiate all possible (or at least relevant) lane counts at compile time so you can runtime dispatch to the right versions. That negatively impacts compile times and code size. And you probably need some kind of semi-automated multiversioning runtime dispatch for every such function to make it ergonomically tolerable (when a lane count specialized function directly calls another lane count specialized function, you can do a direct call to elide the dispatch).

Edit: I see you mentioned a const fn, so I guess you meant it would query the vector width on the machine where the program was compiled. That only works for programs that are compiled from source on the end user's own machine. At least for the kind of software I mostly work on, that's not how we distribute software to end users. And I think that once you start baking in hardware vector widths like that, it loses a lot of the appeal to me of a portable SIMD solution.

@rachtsingh
Copy link

I wanted to ask here instead of opening a new issue since I've been trying to make some dot products fast in Rust for machine learning purposes, but is this issue blocked on something at the moment? This might be naive but it seems like there are some great solutions proposed above and maybe this is in the realm of "perfect is the enemy of good"? At least I think the machine learning use-case doesn't need stability in the underlying reduction algorithm (I understand that there are users that care more).

For the small point that it's worth I mostly work on software that's distributed compiled on the author's machine and so agree with the above statement about a const fn vector width query not being ideal.

I'm happy to run codegen on (some) variety of hardware if that's helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: LLVM blocks-stable C-feature-request Category: a feature request, i.e. not implemented / a PR
Projects
None yet
Development

No branches or pull requests

10 participants