-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate pairwise sum/product flops for reductions/horizontal ops #235
Comments
proposed tree reduction algorithm (compatible with SimpleV's tree reduction algorithm): #![feature(portable_simd)]
use std::simd::{Simd, LaneCount, SupportedLaneCount, Swizzle};
const fn make_swizzle<const N: usize>(step: usize) -> [usize; N] {
let mut retval = [0; N];
let mut i = 0;
while i < N {
retval[i] = i - i % step + step / 2;
if retval[i] >= N {
retval[i] = 0;
}
i += 1;
}
retval
}
struct S<const N: usize, const STEP: usize>;
impl<const N: usize, const STEP: usize> Swizzle<N, N> for S<N, STEP> {
const INDEX: [usize; N] = make_swizzle::<N>(STEP);
}
pub fn tree_reduce<const N: usize>(mut vec: Simd<f32, N>) -> f32
where
LaneCount<N>: SupportedLaneCount,
{
macro_rules! step {
($step:literal) => {
if $step < N * 2 {
vec += S::<N, $step>::swizzle(vec);
}
};
}
step!(2);
step!(4);
step!(8);
step!(16);
step!(32);
step!(64);
vec[0]
} |
Wow, pub fn horizontal_sum(x: Simd<f32, $N>) -> f32 {
x.horizontal_sum()
} completely scalarizes (using example::horizontal_sum_8:
vxorps xmm0, xmm0, xmm0
vaddss xmm0, xmm0, dword ptr [rdi]
vaddss xmm0, xmm0, dword ptr [rdi + 4]
vaddss xmm0, xmm0, dword ptr [rdi + 8]
vaddss xmm0, xmm0, dword ptr [rdi + 12]
vaddss xmm0, xmm0, dword ptr [rdi + 16]
vaddss xmm0, xmm0, dword ptr [rdi + 20]
vaddss xmm0, xmm0, dword ptr [rdi + 24]
vaddss xmm0, xmm0, dword ptr [rdi + 28]
ret |
This doesn’t seem to work as well on AArch64, where it only makes use of EDIT: This behavior gets worse when the ISA supports SVE(2), where it should be happy to use |
it looks better if we reverse the order of the shuffles: Maybe I can convince Libre-SOC to change SimpleV to match if this is a sufficiently good motivation. #![feature(portable_simd)]
use std::simd::{Simd, LaneCount, SupportedLaneCount, Swizzle};
const fn make_swizzle<const N: usize>(step: usize) -> [usize; N] {
let mut retval = [0; N];
let mut i = 0;
while i < N {
retval[i] = i + step / 2;
if retval[i] >= N {
retval[i] = 0;
}
i += 1;
}
retval
}
struct S<const N: usize, const STEP: usize>;
impl<const N: usize, const STEP: usize> Swizzle<N, N> for S<N, STEP> {
const INDEX: [usize; N] = make_swizzle::<N>(STEP);
}
pub fn tree_reduce<const N: usize>(mut vec: Simd<f32, N>) -> f32
where
LaneCount<N>: SupportedLaneCount,
{
macro_rules! step {
($step:literal) => {
if $step < N * 2 {
vec += S::<N, $step>::swizzle(vec);
}
};
}
step!(64);
step!(32);
step!(16);
step!(8);
step!(4);
step!(2);
vec[0]
} |
referenced on https://bugs.libre-soc.org/show_bug.cgi?id=697#c2 |
nice ascii-art diagrams:
|
turns out Arm's |
llvm defaults to the distance=4,2,1 variant...it doesn't support non-power-of-2 vector lengths: |
Hm, that does look a lot better. example::tree_reduce_4:
ldr q0, [x0]
dup v1.2d, v0.d[1]
fadd v0.4s, v0.4s, v1.4s
faddp s0, v0.2s
ret
example::tree_reduce_8:
ldp q1, q0, [x0]
fadd v0.4s, v1.4s, v0.4s
dup v1.2d, v0.d[1]
fadd v0.4s, v0.4s, v1.4s
faddp s0, v0.2s
ret
example::tree_reduce_16:
ldp q0, q1, [x0, #32]
ldp q3, q2, [x0]
fadd v1.4s, v2.4s, v1.4s
fadd v0.4s, v3.4s, v0.4s
fadd v0.4s, v0.4s, v1.4s
dup v1.2d, v0.d[1]
fadd v0.4s, v0.4s, v1.4s
faddp s0, v0.2s
ret Both versions seem to optimize... less than spectacularly when I pass higher versions of example::tree_reduce_64:
vmovaps zmm0, zmmword ptr [rdi + 64]
vmovaps zmm1, zmmword ptr [rdi + 128]
vaddps zmm2, zmm1, zmmword ptr [rdi]
vaddps zmm1, zmm1, dword ptr [rdi]{1to16}
vaddps zmm0, zmm0, zmmword ptr [rdi + 192]
vaddps zmm1, zmm0, zmm1
vaddps zmm0, zmm2, zmm0
vshuff64x2 zmm1, zmm0, zmm1, 78
vaddps zmm0, zmm0, zmm1
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1
vaddps xmm0, xmm0, xmm1
vmovshdup xmm1, xmm0
vaddss xmm0, xmm0, xmm1
vzeroupper
ret |
They’re “better”, but still pretty bad. I would have expected, say, example::tree_reduce_8:
ldp q1, q0, [x0]
fadd v0.4s, v1.4s, v0.4s
mov d1, v0.d[1]
fadd v0.2s, v0.2s, v1.2s
faddp s0, v0.2s
ret since it doesn’t need all 128 bits for the second half. And it gets progressively worse the higher you go. |
Also, the two-input vector form of example::tree_reduce_16:
ldp q0, q1, [x0, #32]
ldp q2, q3, [x0]
faddp v0.4s, v1.4s, v0.4s
faddp v2.4s, v3.4s, v2.4s
faddp v0.4s, v2.4s, v0.4s
mov d1, v0.d[1]
faddp v0.2s, v1.2s, v0.2s
faddp s0, v0.2s
ret |
The extra dup instead of a mov is weird but one relatively low-cost instruction different from a preferable form on the 8-lane version is not actually going to make or break our decisions here, since what we're truly evaluating against is not the ideal hand-rolled assembler: it is an N-instruction serial fadd (or fadda on Arm... but that's not guaranteed to be faster, and it's definitely not more precise). The issues we are encountering here seem to be due to a lack of education on LLVM's part, and we can only do so much to tack against that wind. Jacob seems to have cooked up a variant that uses the "obvious" reduction algorithm we have been expecting would apply (and should allow using faddv, LLVM just doesn't seem to ever emit that...?), and it is not ideal either, but it's a fair bit better while still being Rust-defined: https://rust.godbolt.org/z/fxYE3Yvzb |
I'm not sure how I can help here, but as I implemented some SIMD stuff in First000000000025ded0 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_1>:
25ded0: 48 83 ec 18 sub rsp,0x18
25ded4: 8b 07 mov eax,DWORD PTR [rdi]
25ded6: 48 8d 15 2b 51 4c 00 lea rdx,[rip+0x4c512b] # 723008 <global_1_test_backend_5d019e9d_cgu_0>
25dedd: 31 f6 xor esi,esi
25dedf: 48 8d 7c 24 08 lea rdi,[rsp+0x8]
25dee4: 89 44 24 08 mov DWORD PTR [rsp+0x8],eax
25dee8: e8 e3 01 00 00 call 25e0d0 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj1_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
25deed: c5 fa 10 00 vmovss xmm0,DWORD PTR [rax]
25def1: 48 83 c4 18 add rsp,0x18
25def5: c3 ret
25def6: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
25defd: 00 00 00
000000000025df00 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_2>:
25df00: 48 83 ec 18 sub rsp,0x18
25df04: c5 fa 7e 0f vmovq xmm1,QWORD PTR [rdi]
25df08: 48 8d 15 f9 50 4c 00 lea rdx,[rip+0x4c50f9] # 723008 <global_1_test_backend_5d019e9d_cgu_0>
25df0f: 31 f6 xor esi,esi
25df11: 48 8d 7c 24 08 lea rdi,[rsp+0x8]
25df16: c5 fa 16 c1 vmovshdup xmm0,xmm1
25df1a: c5 f8 58 c1 vaddps xmm0,xmm0,xmm1
25df1e: c5 f8 13 44 24 08 vmovlps QWORD PTR [rsp+0x8],xmm0
25df24: e8 c7 01 00 00 call 25e0f0 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj2_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
25df29: c5 fa 10 00 vmovss xmm0,DWORD PTR [rax]
25df2d: 48 83 c4 18 add rsp,0x18
25df31: c3 ret
25df32: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
25df39: 00 00 00
25df3c: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
000000000025df40 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_4>:
25df40: 48 83 ec 18 sub rsp,0x18
25df44: c4 e3 79 04 07 f5 vpermilps xmm0,XMMWORD PTR [rdi],0xf5
25df4a: c5 f8 58 07 vaddps xmm0,xmm0,XMMWORD PTR [rdi]
25df4e: 31 f6 xor esi,esi
25df50: 48 8d 15 b1 50 4c 00 lea rdx,[rip+0x4c50b1] # 723008 <global_1_test_backend_5d019e9d_cgu_0>
25df57: 48 89 e7 mov rdi,rsp
25df5a: c4 e3 79 04 c8 aa vpermilps xmm1,xmm0,0xaa
25df60: c5 f0 58 c0 vaddps xmm0,xmm1,xmm0
25df64: c5 f8 29 04 24 vmovaps XMMWORD PTR [rsp],xmm0
25df69: e8 a2 01 00 00 call 25e110 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj4_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
25df6e: c5 fa 10 00 vmovss xmm0,DWORD PTR [rax]
25df72: 48 83 c4 18 add rsp,0x18
25df76: c3 ret
25df77: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
25df7e: 00 00
000000000025df80 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_8>:
25df80: 55 push rbp
25df81: 31 f6 xor esi,esi
25df83: 48 89 e5 mov rbp,rsp
25df86: 48 83 e4 e0 and rsp,0xffffffffffffffe0
25df8a: 48 83 ec 20 sub rsp,0x20
25df8e: c4 e3 7d 04 0f f5 vpermilps ymm1,YMMWORD PTR [rdi],0xf5
25df94: c5 f4 58 0f vaddps ymm1,ymm1,YMMWORD PTR [rdi]
25df98: 48 8d 15 69 50 4c 00 lea rdx,[rip+0x4c5069] # 723008 <global_1_test_backend_5d019e9d_cgu_0>
25df9f: 48 89 e7 mov rdi,rsp
25dfa2: c4 e3 7d 04 c1 aa vpermilps ymm0,ymm1,0xaa
25dfa8: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25dfac: c4 e3 7d 04 c8 00 vpermilps ymm1,ymm0,0x0
25dfb2: c4 e3 75 06 c9 11 vperm2f128 ymm1,ymm1,ymm1,0x11
25dfb8: c5 f4 58 c0 vaddps ymm0,ymm1,ymm0
25dfbc: c5 fc 29 04 24 vmovaps YMMWORD PTR [rsp],ymm0
25dfc1: c5 f8 77 vzeroupper
25dfc4: e8 67 01 00 00 call 25e130 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj8_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
25dfc9: c5 fa 10 00 vmovss xmm0,DWORD PTR [rax]
25dfcd: c9 leave
25dfce: c3 ret
25dfcf: 90 nop
000000000025dfd0 <_RNvCs1fmkINL2TB9_12test_backend14tree_reduce_16>:
25dfd0: 55 push rbp
25dfd1: 31 f6 xor esi,esi
25dfd3: 48 89 e5 mov rbp,rsp
25dfd6: 48 83 e4 c0 and rsp,0xffffffffffffffc0
25dfda: 48 83 ec 40 sub rsp,0x40
25dfde: c5 fa 10 7f 1c vmovss xmm7,DWORD PTR [rdi+0x1c]
25dfe3: c5 fa 10 5f 14 vmovss xmm3,DWORD PTR [rdi+0x14]
25dfe8: c5 fa 10 77 0c vmovss xmm6,DWORD PTR [rdi+0xc]
25dfed: c5 fa 10 4f 04 vmovss xmm1,DWORD PTR [rdi+0x4]
25dff2: c5 fa 10 6f 3c vmovss xmm5,DWORD PTR [rdi+0x3c]
25dff7: c5 fa 10 57 34 vmovss xmm2,DWORD PTR [rdi+0x34]
25dffc: c5 c0 14 ff vunpcklps xmm7,xmm7,xmm7
25e000: c5 e0 14 db vunpcklps xmm3,xmm3,xmm3
25e004: c5 fa 10 67 2c vmovss xmm4,DWORD PTR [rdi+0x2c]
25e009: c5 c8 14 f6 vunpcklps xmm6,xmm6,xmm6
25e00d: c5 f0 14 c9 vunpcklps xmm1,xmm1,xmm1
25e011: c5 e0 16 df vmovlhps xmm3,xmm3,xmm7
25e015: c5 fa 10 47 24 vmovss xmm0,DWORD PTR [rdi+0x24]
25e01a: c5 f0 16 ce vmovlhps xmm1,xmm1,xmm6
25e01e: c5 d0 14 ed vunpcklps xmm5,xmm5,xmm5
25e022: c5 e8 14 d2 vunpcklps xmm2,xmm2,xmm2
25e026: c5 d8 14 e4 vunpcklps xmm4,xmm4,xmm4
25e02a: c5 e8 16 d5 vmovlhps xmm2,xmm2,xmm5
25e02e: c4 e3 75 18 cb 01 vinsertf128 ymm1,ymm1,xmm3,0x1
25e034: 48 8d 15 cd 4f 4c 00 lea rdx,[rip+0x4c4fcd] # 723008 <global_1_test_backend_5d019e9d_cgu_0>
25e03b: c5 f8 14 c0 vunpcklps xmm0,xmm0,xmm0
25e03f: c5 f4 58 0f vaddps ymm1,ymm1,YMMWORD PTR [rdi]
25e043: c5 f8 16 c4 vmovlhps xmm0,xmm0,xmm4
25e047: c4 e3 7d 18 c2 01 vinsertf128 ymm0,ymm0,xmm2,0x1
25e04d: c5 fc 58 47 20 vaddps ymm0,ymm0,YMMWORD PTR [rdi+0x20]
25e052: 48 89 e7 mov rdi,rsp
25e055: c4 e3 7d 04 d9 aa vpermilps ymm3,ymm1,0xaa
25e05b: c5 e4 58 d9 vaddps ymm3,ymm3,ymm1
25e05f: c4 e3 7d 04 c8 aa vpermilps ymm1,ymm0,0xaa
25e065: c5 f4 58 c0 vaddps ymm0,ymm1,ymm0
25e069: c4 e3 7d 04 d0 00 vpermilps ymm2,ymm0,0x0
25e06f: c4 e3 6d 06 d2 11 vperm2f128 ymm2,ymm2,ymm2,0x11
25e075: c5 ec 58 d0 vaddps ymm2,ymm2,ymm0
25e079: c4 e3 7d 04 c3 00 vpermilps ymm0,ymm3,0x0
25e07f: c4 e3 7d 06 c0 11 vperm2f128 ymm0,ymm0,ymm0,0x11
25e085: c5 fc 58 c3 vaddps ymm0,ymm0,ymm3
25e089: c4 e2 7d 18 ca vbroadcastss ymm1,xmm2
25e08e: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25e092: c5 f4 58 ca vaddps ymm1,ymm1,ymm2
25e096: c5 fc 29 04 24 vmovaps YMMWORD PTR [rsp],ymm0
25e09b: c5 fc 29 4c 24 20 vmovaps YMMWORD PTR [rsp+0x20],ymm1
25e0a1: c5 f8 77 vzeroupper
25e0a4: e8 07 00 00 00 call 25e0b0 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj10_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
25e0a9: c5 fa 10 00 vmovss xmm0,DWORD PTR [rax]
25e0ad: c9 leave
25e0ae: c3 ret
25e0af: 90 nop Second000000000025ded0 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_1>:
25ded0: 48 83 ec 18 sub rsp,0x18
25ded4: 8b 07 mov eax,DWORD PTR [rdi]
25ded6: 48 8d 15 2b 51 4c 00 lea rdx,[rip+0x4c512b] # 723008 <global_1_test_backend_5d019e9d_cgu_0>
25dedd: 31 f6 xor esi,esi
25dedf: 48 8d 7c 24 08 lea rdi,[rsp+0x8]
25dee4: 89 44 24 08 mov DWORD PTR [rsp+0x8],eax
25dee8: e8 13 02 00 00 call 25e100 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj1_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
25deed: c5 fa 10 00 vmovss xmm0,DWORD PTR [rax]
25def1: 48 83 c4 18 add rsp,0x18
25def5: c3 ret
25def6: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
25defd: 00 00 00
000000000025df00 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_2>:
25df00: 48 83 ec 18 sub rsp,0x18
25df04: c5 fa 7e 0f vmovq xmm1,QWORD PTR [rdi]
25df08: 48 8d 15 f9 50 4c 00 lea rdx,[rip+0x4c50f9] # 723008 <global_1_test_backend_5d019e9d_cgu_0>
25df0f: 31 f6 xor esi,esi
25df11: 48 8d 7c 24 08 lea rdi,[rsp+0x8]
25df16: c5 f0 c6 c1 e1 vshufps xmm0,xmm1,xmm1,0xe1
25df1b: c5 f8 58 c1 vaddps xmm0,xmm0,xmm1
25df1f: c5 f8 13 44 24 08 vmovlps QWORD PTR [rsp+0x8],xmm0
25df25: e8 f6 01 00 00 call 25e120 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj2_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
25df2a: c5 fa 10 00 vmovss xmm0,DWORD PTR [rax]
25df2e: 48 83 c4 18 add rsp,0x18
25df32: c3 ret
25df33: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
25df3a: 00 00 00
25df3d: 0f 1f 00 nop DWORD PTR [rax]
000000000025df40 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_4>:
25df40: 48 83 ec 18 sub rsp,0x18
25df44: c4 e3 79 04 07 0e vpermilps xmm0,XMMWORD PTR [rdi],0xe
25df4a: c5 f8 58 07 vaddps xmm0,xmm0,XMMWORD PTR [rdi]
25df4e: 31 f6 xor esi,esi
25df50: 48 8d 15 b1 50 4c 00 lea rdx,[rip+0x4c50b1] # 723008 <global_1_test_backend_5d019e9d_cgu_0>
25df57: 48 89 e7 mov rdi,rsp
25df5a: c4 e3 79 04 c8 39 vpermilps xmm1,xmm0,0x39
25df60: c5 f0 58 c0 vaddps xmm0,xmm1,xmm0
25df64: c5 f8 29 04 24 vmovaps XMMWORD PTR [rsp],xmm0
25df69: e8 d2 01 00 00 call 25e140 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj4_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
25df6e: c5 fa 10 00 vmovss xmm0,DWORD PTR [rax]
25df72: 48 83 c4 18 add rsp,0x18
25df76: c3 ret
25df77: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
25df7e: 00 00
000000000025df80 <_RNvCs1fmkINL2TB9_12test_backend13tree_reduce_8>:
25df80: 55 push rbp
25df81: 31 f6 xor esi,esi
25df83: 48 89 e5 mov rbp,rsp
25df86: 48 83 e4 e0 and rsp,0xffffffffffffffe0
25df8a: 48 83 ec 20 sub rsp,0x20
25df8e: c5 fc 28 07 vmovaps ymm0,YMMWORD PTR [rdi]
25df92: c5 fd 6f 0d 66 90 3f vmovdqa ymm1,YMMWORD PTR [rip+0x3f9066] # 657000 <_fini+0xce4>
25df99: 00
25df9a: 48 8d 15 67 50 4c 00 lea rdx,[rip+0x4c5067] # 723008 <global_1_test_backend_5d019e9d_cgu_0>
25dfa1: 48 89 e7 mov rdi,rsp
25dfa4: c4 e2 75 36 c8 vpermd ymm1,ymm1,ymm0
25dfa9: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25dfad: c5 fd 6f 0d 6b 90 3f vmovdqa ymm1,YMMWORD PTR [rip+0x3f906b] # 657020 <_fini+0xd04>
25dfb4: 00
25dfb5: c4 e2 75 36 c8 vpermd ymm1,ymm1,ymm0
25dfba: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25dfbe: c5 fd 6f 0d 7a 90 3f vmovdqa ymm1,YMMWORD PTR [rip+0x3f907a] # 657040 <_fini+0xd24>
25dfc5: 00
25dfc6: c4 e2 75 36 c8 vpermd ymm1,ymm1,ymm0
25dfcb: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25dfcf: c5 fc 29 04 24 vmovaps YMMWORD PTR [rsp],ymm0
25dfd4: c5 f8 77 vzeroupper
25dfd7: e8 84 01 00 00 call 25e160 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj8_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
25dfdc: c5 fa 10 00 vmovss xmm0,DWORD PTR [rax]
25dfe0: c9 leave
25dfe1: c3 ret
25dfe2: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
25dfe9: 00 00 00
25dfec: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
000000000025dff0 <_RNvCs1fmkINL2TB9_12test_backend14tree_reduce_16>:
25dff0: 55 push rbp
25dff1: 31 f6 xor esi,esi
25dff3: 48 89 e5 mov rbp,rsp
25dff6: 48 83 e4 c0 and rsp,0xffffffffffffffc0
25dffa: 48 83 ec 40 sub rsp,0x40
25dffe: c5 fa 10 47 38 vmovss xmm0,DWORD PTR [rdi+0x38]
25e003: c5 fa 10 4f 30 vmovss xmm1,DWORD PTR [rdi+0x30]
25e008: c4 e3 79 21 47 3c 10 vinsertps xmm0,xmm0,DWORD PTR [rdi+0x3c],0x10
25e00f: c4 e3 71 21 4f 34 10 vinsertps xmm1,xmm1,DWORD PTR [rdi+0x34],0x10
25e016: c5 fd 6f 25 82 90 3f vmovdqa ymm4,YMMWORD PTR [rip+0x3f9082] # 6570a0 <_fini+0xd84>
25e01d: 00
25e01e: 48 8d 15 e3 4f 4c 00 lea rdx,[rip+0x4c4fe3] # 723008 <global_1_test_backend_5d019e9d_cgu_0>
25e025: c5 f0 16 c8 vmovlhps xmm1,xmm1,xmm0
25e029: c5 fa 10 47 28 vmovss xmm0,DWORD PTR [rdi+0x28]
25e02e: c4 e3 79 21 57 2c 10 vinsertps xmm2,xmm0,DWORD PTR [rdi+0x2c],0x10
25e035: c5 fa 10 47 20 vmovss xmm0,DWORD PTR [rdi+0x20]
25e03a: c4 e3 79 21 47 24 10 vinsertps xmm0,xmm0,DWORD PTR [rdi+0x24],0x10
25e041: c5 f8 16 c2 vmovlhps xmm0,xmm0,xmm2
25e045: c4 e3 7d 18 c1 01 vinsertf128 ymm0,ymm0,xmm1,0x1
25e04b: c4 e2 7d 18 0f vbroadcastss ymm1,DWORD PTR [rdi]
25e050: c5 fc 58 07 vaddps ymm0,ymm0,YMMWORD PTR [rdi]
25e054: c5 f4 58 4f 20 vaddps ymm1,ymm1,YMMWORD PTR [rdi+0x20]
25e059: 48 89 e7 mov rdi,rsp
25e05c: c4 e3 7d 06 d9 21 vperm2f128 ymm3,ymm0,ymm1,0x21
25e062: c5 e4 58 d8 vaddps ymm3,ymm3,ymm0
25e066: c4 e3 75 06 c0 21 vperm2f128 ymm0,ymm1,ymm0,0x21
25e06c: c4 e2 7d 0c 05 eb 8f vpermilps ymm0,ymm0,YMMWORD PTR [rip+0x3f8feb] # 657060 <_fini+0xd44>
25e073: 3f 00
25e075: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25e079: c5 fd 6f 0d ff 8f 3f vmovdqa ymm1,YMMWORD PTR [rip+0x3f8fff] # 657080 <_fini+0xd64>
25e080: 00
25e081: c4 e3 65 06 d0 21 vperm2f128 ymm2,ymm3,ymm0,0x21
25e087: c4 e2 75 36 c8 vpermd ymm1,ymm1,ymm0
25e08c: c5 e4 c6 d2 4e vshufps ymm2,ymm3,ymm2,0x4e
25e091: c5 ec 58 d3 vaddps ymm2,ymm2,ymm3
25e095: c4 e2 5d 36 db vpermd ymm3,ymm4,ymm3
25e09a: c4 e3 75 0c cb c0 vblendps ymm1,ymm1,ymm3,0xc0
25e0a0: c5 f4 58 c0 vaddps ymm0,ymm1,ymm0
25e0a4: c4 e3 6d 06 c8 21 vperm2f128 ymm1,ymm2,ymm0,0x21
25e0aa: c4 e3 75 0f ca 04 vpalignr ymm1,ymm1,ymm2,0x4
25e0b0: c5 ec 58 c9 vaddps ymm1,ymm2,ymm1
25e0b4: c4 e3 7d 06 d2 21 vperm2f128 ymm2,ymm0,ymm2,0x21
25e0ba: c4 e3 6d 0f d0 04 vpalignr ymm2,ymm2,ymm0,0x4
25e0c0: c5 fc 58 c2 vaddps ymm0,ymm0,ymm2
25e0c4: c5 fc 29 0c 24 vmovaps YMMWORD PTR [rsp],ymm1
25e0c9: c5 fc 29 44 24 20 vmovaps YMMWORD PTR [rsp+0x20],ymm0
25e0cf: c5 f8 77 vzeroupper
25e0d2: e8 09 00 00 00 call 25e0e0 <_RNvXNtNtCsg3zoJvvKveJ_4core9core_simd3opsINtNtB4_6vector4SimdfKj10_EINtNtNtB6_3ops5index5IndexjE5indexCs1fmkINL2TB9_12test_backend>
25e0d7: c5 fa 10 00 vmovss xmm0,DWORD PTR [rax]
25e0db: c9 leave
25e0dc: c3 ret
25e0dd: 0f 1f 00 nop DWORD PTR [rax] To be noted:
|
In LLVM any function in the same cgu is inlinable unless you use |
Ah, no, it was because of the CGUs: I assume rustc has ThinLTO enabled by default. First000000000025dec0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_1>:
25dec0: c5 fa 10 07 vmovss xmm0,DWORD PTR [rdi]
25dec4: c3 ret
25dec5: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
25decc: 00 00 00
25decf: 90 nop
000000000025ded0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_2>:
25ded0: c5 fa 7e 0f vmovq xmm1,QWORD PTR [rdi]
25ded4: c5 fa 16 c1 vmovshdup xmm0,xmm1
25ded8: c5 f8 58 c1 vaddps xmm0,xmm0,xmm1
25dedc: c3 ret
25dedd: 0f 1f 00 nop DWORD PTR [rax]
000000000025dee0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_4>:
25dee0: c4 e3 79 04 0f f5 vpermilps xmm1,XMMWORD PTR [rdi],0xf5
25dee6: c5 f0 58 0f vaddps xmm1,xmm1,XMMWORD PTR [rdi]
25deea: c4 e3 79 04 c1 aa vpermilps xmm0,xmm1,0xaa
25def0: c5 f8 58 c1 vaddps xmm0,xmm0,xmm1
25def4: c3 ret
25def5: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
25defc: 00 00 00
25deff: 90 nop
000000000025df00 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_8>:
25df00: c4 e3 7d 04 07 f5 vpermilps ymm0,YMMWORD PTR [rdi],0xf5
25df06: c5 fc 58 07 vaddps ymm0,ymm0,YMMWORD PTR [rdi]
25df0a: c4 e3 7d 04 c8 aa vpermilps ymm1,ymm0,0xaa
25df10: c5 f4 58 c8 vaddps ymm1,ymm1,ymm0
25df14: c4 e3 7d 04 c1 00 vpermilps ymm0,ymm1,0x0
25df1a: c4 e3 7d 06 c0 11 vperm2f128 ymm0,ymm0,ymm0,0x11
25df20: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25df24: c5 f8 77 vzeroupper
25df27: c3 ret
25df28: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
25df2f: 00
000000000025df30 <_RNvCsacpq7FklLKk_12test_backend14tree_reduce_16>:
25df30: c5 fa 10 7f 1c vmovss xmm7,DWORD PTR [rdi+0x1c]
25df35: c5 fa 10 5f 14 vmovss xmm3,DWORD PTR [rdi+0x14]
25df3a: c5 fa 10 77 0c vmovss xmm6,DWORD PTR [rdi+0xc]
25df3f: c5 fa 10 4f 04 vmovss xmm1,DWORD PTR [rdi+0x4]
25df44: c5 fa 10 6f 3c vmovss xmm5,DWORD PTR [rdi+0x3c]
25df49: c5 fa 10 57 34 vmovss xmm2,DWORD PTR [rdi+0x34]
25df4e: c5 c0 14 ff vunpcklps xmm7,xmm7,xmm7
25df52: c5 e0 14 db vunpcklps xmm3,xmm3,xmm3
25df56: c5 fa 10 67 2c vmovss xmm4,DWORD PTR [rdi+0x2c]
25df5b: c5 c8 14 f6 vunpcklps xmm6,xmm6,xmm6
25df5f: c5 f0 14 c9 vunpcklps xmm1,xmm1,xmm1
25df63: c5 e0 16 df vmovlhps xmm3,xmm3,xmm7
25df67: c5 fa 10 47 24 vmovss xmm0,DWORD PTR [rdi+0x24]
25df6c: c5 f0 16 ce vmovlhps xmm1,xmm1,xmm6
25df70: c5 d0 14 ed vunpcklps xmm5,xmm5,xmm5
25df74: c5 e8 14 d2 vunpcklps xmm2,xmm2,xmm2
25df78: c5 d8 14 e4 vunpcklps xmm4,xmm4,xmm4
25df7c: c5 e8 16 d5 vmovlhps xmm2,xmm2,xmm5
25df80: c4 e3 75 18 cb 01 vinsertf128 ymm1,ymm1,xmm3,0x1
25df86: c5 f8 14 c0 vunpcklps xmm0,xmm0,xmm0
25df8a: c5 f4 58 0f vaddps ymm1,ymm1,YMMWORD PTR [rdi]
25df8e: c5 f8 16 c4 vmovlhps xmm0,xmm0,xmm4
25df92: c4 e3 7d 18 c2 01 vinsertf128 ymm0,ymm0,xmm2,0x1
25df98: c5 fc 58 47 20 vaddps ymm0,ymm0,YMMWORD PTR [rdi+0x20]
25df9d: c4 e3 7d 04 d1 aa vpermilps ymm2,ymm1,0xaa
25dfa3: c5 ec 58 d1 vaddps ymm2,ymm2,ymm1
25dfa7: c4 e3 7d 04 c8 aa vpermilps ymm1,ymm0,0xaa
25dfad: c5 f4 58 c0 vaddps ymm0,ymm1,ymm0
25dfb1: c4 e3 7d 04 c8 00 vpermilps ymm1,ymm0,0x0
25dfb7: c4 e3 75 06 c9 11 vperm2f128 ymm1,ymm1,ymm1,0x11
25dfbd: c5 f4 58 c8 vaddps ymm1,ymm1,ymm0
25dfc1: c4 e3 7d 04 c2 00 vpermilps ymm0,ymm2,0x0
25dfc7: c4 e3 7d 06 c0 11 vperm2f128 ymm0,ymm0,ymm0,0x11
25dfcd: c5 fc 58 c2 vaddps ymm0,ymm0,ymm2
25dfd1: c4 e2 7d 18 c9 vbroadcastss ymm1,xmm1
25dfd6: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25dfda: c5 f8 77 vzeroupper
25dfdd: c3 ret Second000000000025dec0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_1>:
25dec0: c5 fa 10 07 vmovss xmm0,DWORD PTR [rdi]
25dec4: c3 ret
25dec5: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
25decc: 00 00 00
25decf: 90 nop
000000000025ded0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_2>:
25ded0: c5 fa 7e 0f vmovq xmm1,QWORD PTR [rdi]
25ded4: c5 f0 c6 c1 e1 vshufps xmm0,xmm1,xmm1,0xe1
25ded9: c5 f8 58 c1 vaddps xmm0,xmm0,xmm1
25dedd: c3 ret
25dede: 66 90 xchg ax,ax
000000000025dee0 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_4>:
25dee0: c4 e3 79 04 0f 0e vpermilps xmm1,XMMWORD PTR [rdi],0xe
25dee6: c5 f0 58 0f vaddps xmm1,xmm1,XMMWORD PTR [rdi]
25deea: c4 e3 79 04 c1 39 vpermilps xmm0,xmm1,0x39
25def0: c5 f8 58 c1 vaddps xmm0,xmm0,xmm1
25def4: c3 ret
25def5: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
25defc: 00 00 00
25deff: 90 nop
000000000025df00 <_RNvCsacpq7FklLKk_12test_backend13tree_reduce_8>:
25df00: c5 fc 28 07 vmovaps ymm0,YMMWORD PTR [rdi]
25df04: c5 fd 6f 0d f4 a0 42 vmovdqa ymm1,YMMWORD PTR [rip+0x42a0f4] # 688000 <_fini+0xccc>
25df0b: 00
25df0c: c4 e2 75 36 c8 vpermd ymm1,ymm1,ymm0
25df11: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25df15: c5 fd 6f 0d 03 a1 42 vmovdqa ymm1,YMMWORD PTR [rip+0x42a103] # 688020 <_fini+0xcec>
25df1c: 00
25df1d: c4 e2 75 36 c8 vpermd ymm1,ymm1,ymm0
25df22: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25df26: c5 fd 6f 0d 12 a1 42 vmovdqa ymm1,YMMWORD PTR [rip+0x42a112] # 688040 <_fini+0xd0c>
25df2d: 00
25df2e: c4 e2 75 36 c8 vpermd ymm1,ymm1,ymm0
25df33: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25df37: c5 f8 77 vzeroupper
25df3a: c3 ret
25df3b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
000000000025df40 <_RNvCsacpq7FklLKk_12test_backend14tree_reduce_16>:
25df40: c5 fa 10 47 38 vmovss xmm0,DWORD PTR [rdi+0x38]
25df45: c4 e3 79 21 4f 3c 10 vinsertps xmm1,xmm0,DWORD PTR [rdi+0x3c],0x10
25df4c: c5 fa 10 47 30 vmovss xmm0,DWORD PTR [rdi+0x30]
25df51: c4 e3 79 21 47 34 10 vinsertps xmm0,xmm0,DWORD PTR [rdi+0x34],0x10
25df58: c5 fd 6f 25 40 a1 42 vmovdqa ymm4,YMMWORD PTR [rip+0x42a140] # 6880a0 <_fini+0xd6c>
25df5f: 00
25df60: c5 f8 16 c1 vmovlhps xmm0,xmm0,xmm1
25df64: c5 fa 10 4f 28 vmovss xmm1,DWORD PTR [rdi+0x28]
25df69: c4 e3 71 21 57 2c 10 vinsertps xmm2,xmm1,DWORD PTR [rdi+0x2c],0x10
25df70: c5 fa 10 4f 20 vmovss xmm1,DWORD PTR [rdi+0x20]
25df75: c4 e3 71 21 4f 24 10 vinsertps xmm1,xmm1,DWORD PTR [rdi+0x24],0x10
25df7c: c5 f0 16 ca vmovlhps xmm1,xmm1,xmm2
25df80: c5 fd 6f 15 f8 a0 42 vmovdqa ymm2,YMMWORD PTR [rip+0x42a0f8] # 688080 <_fini+0xd4c>
25df87: 00
25df88: c4 e3 75 18 c8 01 vinsertf128 ymm1,ymm1,xmm0,0x1
25df8e: c4 e2 7d 18 07 vbroadcastss ymm0,DWORD PTR [rdi]
25df93: c5 f4 58 0f vaddps ymm1,ymm1,YMMWORD PTR [rdi]
25df97: c5 fc 58 47 20 vaddps ymm0,ymm0,YMMWORD PTR [rdi+0x20]
25df9c: c4 e3 75 06 d8 21 vperm2f128 ymm3,ymm1,ymm0,0x21
25dfa2: c5 e4 58 d9 vaddps ymm3,ymm3,ymm1
25dfa6: c4 e3 7d 06 c9 21 vperm2f128 ymm1,ymm0,ymm1,0x21
25dfac: c4 e2 75 0c 0d ab a0 vpermilps ymm1,ymm1,YMMWORD PTR [rip+0x42a0ab] # 688060 <_fini+0xd2c>
25dfb3: 42 00
25dfb5: c5 f4 58 c8 vaddps ymm1,ymm1,ymm0
25dfb9: c4 e3 65 06 c1 21 vperm2f128 ymm0,ymm3,ymm1,0x21
25dfbf: c4 e2 6d 36 d1 vpermd ymm2,ymm2,ymm1
25dfc4: c5 e4 c6 c0 4e vshufps ymm0,ymm3,ymm0,0x4e
25dfc9: c5 fc 58 c3 vaddps ymm0,ymm0,ymm3
25dfcd: c4 e2 5d 36 db vpermd ymm3,ymm4,ymm3
25dfd2: c4 e3 6d 0c d3 c0 vblendps ymm2,ymm2,ymm3,0xc0
25dfd8: c5 ec 58 c9 vaddps ymm1,ymm2,ymm1
25dfdc: c4 e3 7d 06 c9 21 vperm2f128 ymm1,ymm0,ymm1,0x21
25dfe2: c4 e3 75 0f c8 04 vpalignr ymm1,ymm1,ymm0,0x4
25dfe8: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
25dfec: c5 f8 77 vzeroupper
25dfef: c3 ret It looks better now, but the |
Right, in release mode local-thinlto is enabled by default. This is ThinLTO applied to all codegen units in a single crate, unlike regular ThinLTO which applies to all crates together. |
i could be totally wrong, but can't you just call |
Thank you!
Ah! Is there any summary somewhere of what GCC exposes in terms of generalized vector logic? |
No, the implementation I use is there. Edit: I just realized you mentioned Edit 2: It seems |
|
Reading that, I am curious, because it doesn't specify one way or another, and I would like to believe the naive possibility: Does typedef int v4si __attribute__ ((vector_size (16)));
v4si shuffle1(v4si a, v4si mask) {
return __builtin_shuffle (a, mask);
}
v4si shuffle2(v4si a, v4si b, v4si mask) {
return __builtin_shuffle (a, b, mask);
} |
It seems to accept non-constant shuffle masks. |
|
I fixed the vector shuffle in rustc_codegen_gcc and I realized I was also missing the GCC cli argument First0000000000267860 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_1>:
267860: c5 fa 10 07 vmovss xmm0,DWORD PTR [rdi]
267864: c3 ret
267865: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
26786c: 00 00 00
26786f: 90 nop
0000000000267870 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_2>:
267870: c5 fa 7e 0f vmovq xmm1,QWORD PTR [rdi]
267874: c5 fa 16 c1 vmovshdup xmm0,xmm1
267878: c5 f8 58 c1 vaddps xmm0,xmm0,xmm1
26787c: c3 ret
26787d: 0f 1f 00 nop DWORD PTR [rax]
0000000000267880 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_4>:
267880: c4 e3 79 04 0f f5 vpermilps xmm1,XMMWORD PTR [rdi],0xf5
267886: c5 f0 58 0f vaddps xmm1,xmm1,XMMWORD PTR [rdi]
26788a: c4 e3 79 04 c1 aa vpermilps xmm0,xmm1,0xaa
267890: c5 f8 58 c1 vaddps xmm0,xmm0,xmm1
267894: c3 ret
267895: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
26789c: 00 00 00
26789f: 90 nop
00000000002678a0 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_8>:
2678a0: c4 e3 7d 04 07 f5 vpermilps ymm0,YMMWORD PTR [rdi],0xf5
2678a6: c5 fc 58 07 vaddps ymm0,ymm0,YMMWORD PTR [rdi]
2678aa: c4 e3 7d 04 c8 aa vpermilps ymm1,ymm0,0xaa
2678b0: c5 f4 58 c8 vaddps ymm1,ymm1,ymm0
2678b4: c4 e3 7d 04 c1 00 vpermilps ymm0,ymm1,0x0
2678ba: c4 e3 7d 06 c0 11 vperm2f128 ymm0,ymm0,ymm0,0x11
2678c0: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
2678c4: c5 f8 77 vzeroupper
2678c7: c3 ret
2678c8: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
2678cf: 00
00000000002678d0 <_RNvCs49eEDcGqTSg_12test_backend14tree_reduce_16>:
2678d0: 62 f3 7d 48 04 0f f5 vpermilps zmm1,ZMMWORD PTR [rdi],0xf5
2678d7: 62 f1 74 48 58 0f vaddps zmm1,zmm1,ZMMWORD PTR [rdi]
2678dd: b8 08 00 00 00 mov eax,0x8
2678e2: 62 f3 7d 48 04 c1 aa vpermilps zmm0,zmm1,0xaa
2678e9: 62 f1 7c 48 58 c1 vaddps zmm0,zmm0,zmm1
2678ef: 62 f1 7d 48 6f 0d 07 vmovdqa32 zmm1,ZMMWORD PTR [rip+0x3ef707] # 657000 <_fini+0xb88>
2678f6: f7 3e 00
2678f9: 62 f2 75 48 16 c8 vpermps zmm1,zmm1,zmm0
2678ff: 62 f1 74 48 58 c8 vaddps zmm1,zmm1,zmm0
267905: 62 f2 7d 48 7c c0 vpbroadcastd zmm0,eax
26790b: 62 f2 7d 48 16 c1 vpermps zmm0,zmm0,zmm1
267911: 62 f1 7c 48 58 c1 vaddps zmm0,zmm0,zmm1
267917: c5 f8 77 vzeroupper
26791a: c3 ret Second0000000000267860 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_1>:
267860: c5 fa 10 07 vmovss xmm0,DWORD PTR [rdi]
267864: c3 ret
267865: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
26786c: 00 00 00
26786f: 90 nop
0000000000267870 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_2>:
267870: c5 fa 7e 0f vmovq xmm1,QWORD PTR [rdi]
267874: c5 f0 c6 c1 e1 vshufps xmm0,xmm1,xmm1,0xe1
267879: c5 f8 58 c1 vaddps xmm0,xmm0,xmm1
26787d: c3 ret
26787e: 66 90 xchg ax,ax
0000000000267880 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_4>:
267880: c4 e3 79 04 0f 0e vpermilps xmm1,XMMWORD PTR [rdi],0xe
267886: c5 f0 58 0f vaddps xmm1,xmm1,XMMWORD PTR [rdi]
26788a: c4 e3 79 04 c1 39 vpermilps xmm0,xmm1,0x39
267890: c5 f8 58 c1 vaddps xmm0,xmm0,xmm1
267894: c3 ret
267895: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]
26789c: 00 00 00
26789f: 90 nop
00000000002678a0 <_RNvCs49eEDcGqTSg_12test_backend13tree_reduce_8>:
2678a0: c5 fd 6f 0d 58 f7 3e vmovdqa ymm1,YMMWORD PTR [rip+0x3ef758] # 657000 <_fini+0xb60>
2678a7: 00
2678a8: c5 fc 28 07 vmovaps ymm0,YMMWORD PTR [rdi]
2678ac: c4 e2 75 36 c8 vpermd ymm1,ymm1,ymm0
2678b1: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
2678b5: c5 fd 6f 0d 63 f7 3e vmovdqa ymm1,YMMWORD PTR [rip+0x3ef763] # 657020 <_fini+0xb80>
2678bc: 00
2678bd: c4 e2 75 36 c8 vpermd ymm1,ymm1,ymm0
2678c2: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
2678c6: c5 fd 6f 0d 72 f7 3e vmovdqa ymm1,YMMWORD PTR [rip+0x3ef772] # 657040 <_fini+0xba0>
2678cd: 00
2678ce: c4 e2 75 36 c8 vpermd ymm1,ymm1,ymm0
2678d3: c5 fc 58 c1 vaddps ymm0,ymm0,ymm1
2678d7: c5 f8 77 vzeroupper
2678da: c3 ret
2678db: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
00000000002678e0 <_RNvCs49eEDcGqTSg_12test_backend14tree_reduce_16>:
2678e0: 62 f1 7d 48 6f 0d 96 vmovdqa32 zmm1,ZMMWORD PTR [rip+0x3ef796] # 657080 <_fini+0xbe0>
2678e7: f7 3e 00
2678ea: 62 f1 7c 48 28 07 vmovaps zmm0,ZMMWORD PTR [rdi]
2678f0: 62 f2 75 48 16 c8 vpermps zmm1,zmm1,zmm0
2678f6: 62 f1 74 48 58 c8 vaddps zmm1,zmm1,zmm0
2678fc: 62 f1 7d 48 6f 05 ba vmovdqa32 zmm0,ZMMWORD PTR [rip+0x3ef7ba] # 6570c0 <_fini+0xc20>
267903: f7 3e 00
267906: 62 f2 7d 48 16 c1 vpermps zmm0,zmm0,zmm1
26790c: 62 f1 7c 48 58 c1 vaddps zmm0,zmm0,zmm1
267912: 62 f1 7d 48 6f 0d e4 vmovdqa32 zmm1,ZMMWORD PTR [rip+0x3ef7e4] # 657100 <_fini+0xc60>
267919: f7 3e 00
26791c: 62 f2 75 48 16 c8 vpermps zmm1,zmm1,zmm0
267922: 62 f1 74 48 58 c8 vaddps zmm1,zmm1,zmm0
267928: 62 f1 7d 48 6f 05 0e vmovdqa32 zmm0,ZMMWORD PTR [rip+0x3ef80e] # 657140 <_fini+0xca0>
26792f: f8 3e 00
267932: 62 f2 7d 48 16 c1 vpermps zmm0,zmm0,zmm1
267938: 62 f1 7c 48 58 c1 vaddps zmm0,zmm0,zmm1
26793e: c5 f8 77 vzeroupper
267941: c3 ret |
I just started experimenting with std::simd and after studying code quality it became apparent that reduce's left-associative ordering is the biggest issue right now for efficiently implementing common-place operations like dot products. I ended up writing a generic reduce_symmetric function to do the job: pub fn reduce_symmetric<T, const N: usize>(a: Simd<T, N>, mut f: impl FnMut(Simd<T, N>, Simd<T, N>) -> Simd<T, N>) -> T
where
LaneCount<N>: SupportedLaneCount,
T: SimdElement,
{
assert!(N.is_power_of_two());
(0..N.ilog2()).fold(a, |a, _| {
let (lo, hi) = a.interleave(a);
f(lo, hi)
})[0]
} I have a bunch of side-by-side studies of dot products here that might be interesting: https://godbolt.org/z/xzhnr1K83 The issue with ordered reduction becomes greatly aggravated as the vectors get wider, especially in cases when you want to use f32x32 or f32x64 to engage more parallel vector registers to overcome latency (which also lets you take advantage of AVX512 while lowering to multiple vectors for narrower-width vector ISAs like SSE, NEON and AVX2). For example, if f32x32 gets lowered to 8x f32x4 on SSE and NEON targets, then to do a reduce_sum on a f32x32 forces eight separate horizontal reductions (and each horizontal reduction is internally suboptimal) rather than only one horizontal reduction at the very end when doing the final reduction from f32x4 to f32. Because of this, I believe the API should be restructured around the performance principle of least surprise. As a concrete suggestion: reduce_op should use the well-defined power-of-two factorization (the interleave variant so wider-than-native vector lowering can exploit vertical adds for reduction) and reduce_op_ordered would then offer the current left-chained semantics. In terms of the least surprise principle, the main issues I can see right now with this suggestion:
Footnote: I initially considered the deinterleave variant of symmetric reduction which has the "advantage" of only reassociating rather than commuting terms. However, it does not play well when lowering from wider-than-native vectors, e.g. for f32x8 on a f32x4 target where a f32x8 value is represented by a pair of f32x4 values we want to be able to do a vertical (vector-to-vector) reduction of the pair and then a single horizontal (vector-to-scalar) reduction. The deinterleave variant would perform a horizontal reduction on each of the vectors and then add the resulting scalars. Thoughts? |
I'm okay with renaming to I don't have a strong opinion on the actual association order, but I don't think left-to-right makes much sense other than being predictable. Maybe there are some unique cases where that's necessary. One question we need to answer is whether the reductions need to be the same on every architecture. IEEE 754 hints that they should be numerically stable (it specifically says they should avoid underflow or overflow) but leaves it implementation defined. That leaves the decision to us, to determine what we consider "portable". If we decide to make it consistent across architectures, I think we should still leave the exact implementation unspecified. In the future we may decide the specific algorithm performs poorly on some architecture, and we devise a new one. If a user really needs to know the exact algorithm, they should probably implement their own. One option would be an API like the following (I can't remember if you can have default const generics) enum ReduceOrder {
Stable,
Sequence,
Any,
}
fn reduce_sum<const ORDER: ReduceOrder = Stable>(self) { ... } |
The IEEE-754-2008 standardese on vector reductions is indeed left purposely vague to the point where I believe they don't say anything about precision at all. I'm not sure if taking advantage of that extreme leeway is advised as the default behavior, but at least it gives enough elbow room that any reasonable solution will meet the requirements, including what we're discussing. I can see the argument for leaving reduction order unspecified for implementation flexibility. On the other hand, there is also a strong argument for making the "default" APIs provide well-specified behavior so you can achieve reproducibility across the major architectures for the fundamental core of the API. That's possible with IEEE-754-2008 (modulo NaN bit pattern propagation and the details around pre-rounding vs post-rounding tininess detection for FMA, but you can avoid FMA if that's an issue) and seems like a very worthwhile goal to me, even if it cannot practically be extended to every last part of the API (e.g. algebraic and transcendental functions). Making it consistent across architectures but not across standard library versions is definitely better than leaving it totally unspecified and might be a sufficient compromise. Worst case, anyone who wants a specific ordering that is long-term stable can always ask for it explicitly by picking what function variants they call. |
Ah, I remember the discussion we had about this, and I think I had misunderstood your desire to leave it underspecified as for "let's vary it from architecture to architecture" reasons instead of "let's allow the freedom to vary it from version to version" reasons... obviously if we have something that has a better average performance on most architectures, we should consider it, so I understand leaving that unstable for longer. If we have consensus on using a consistent algorithm from architecture to architecture, then I agree we should do that. |
as mentioned previously, I proposed using a tree-reduction with power-of-2 semantics since that's reasonably efficient on most architectures and SVP64 has a specific set of instructions for that exact pattern. simplified algorithm: pub fn tree_reduce_SVP64<T>(
mut input: Vec<T>,
mut f: impl FnMut(&T, &T) -> T,
) -> Option<T> {
let mut step = 1;
while step < input.len() {
step *= 2;
for i in (0..input.len()).step_by(step) {
let other = i + step / 2;
if other < input.len() {
let v = f(&input[i], &input[other]);
input[i] = v;
}
}
}
input.into_iter().next()
}
#[test]
fn test() {
fn run(len: usize) -> String {
let mut input = vec![];
for i in 0..len {
input.push(format!("x{i}"));
}
tree_reduce_SVP64(input, |l, r| format!("({l} + {r})")).unwrap_or_default()
}
assert_eq!(run(0), "");
assert_eq!(run(1), "x0");
assert_eq!(run(2), "(x0 + x1)");
assert_eq!(run(3), "((x0 + x1) + x2)");
assert_eq!(run(4), "((x0 + x1) + (x2 + x3))");
assert_eq!(run(5), "(((x0 + x1) + (x2 + x3)) + x4)");
assert_eq!(run(6), "(((x0 + x1) + (x2 + x3)) + (x4 + x5))");
assert_eq!(run(7), "(((x0 + x1) + (x2 + x3)) + ((x4 + x5) + x6))");
assert_eq!(run(8), "(((x0 + x1) + (x2 + x3)) + ((x4 + x5) + (x6 + x7)))");
} |
Right, that tree reduction is what I referred to as the deinterleave reduction in my footnote. It can be a tiny bit better than the interleave reduction on some ISAs for native-width vector reduction (on NEON interleave is better even for the native-width case), but in general you have an increasingly serious code quality issue when you lower from wider-than-native vector types, and IMHO that kind of lowering should be a top priority for a portable SIMD library since you want to avoid people being pushed towards target-specific cfg flags to tune vector widths if at all possible. Here's some side-by-side comparisons of interleave vs deinterleave sum reduction for f32x4, f32x8, f32x16, f32x32 and f32x64 on SSE, AVX, AVX512 and NEON: https://godbolt.org/z/jqTKzr3hx As you can see [1], the code generated for the deinterleave sums gets more and more problematic, to the point where reduce_sum_deinterleave_f32x64 on SSE generates a 6-iteration loop with a massive body since the compiler has had enough. I don't have any first-hand knowledge of SVP64, but I would be a little surprised (maybe you could explain more?) if there isn't an equally or almost equally fast way to do interleaves on SVP64 as deinterleaves since they're two of the fundamental length-parametric permutations on power-of-two vector pairs and constantly show up in applications. [1] Note that just looking at the code size for the lowered deinterleaves on AVX2 and AVX512 is misleading since the generated code relies heavily on the hadd (horizontal add) instruction which despite being one instruction is costly in terms of both latency and throughput, fairly comparable to an equivalent instruction sequence (it is microsequenced or microcoded at least on some uarches) but with less instruction scheduling freedom which AFAIK can sometimes be a net negative. |
there are fast-ish interleave/deinterleave ops, however the reduction algorithm I mentioned is built into the hardware such that any reduction is two setup instructions plus one vector instruction that runs that full reduction algorithm -- making it substantially faster than the separate vector instructions that would be needed for each stage of reduction. that works for all supported vector lengths, so for the current draft version of SVP64 that is 0-64 elements including non-powers-of-2. assembly to fadd-reduce the # set SVSHAPE* registers to tree-reduction mode for 17 elements
svshape 17, 1, 1, 7, 0
# enable remapping element numbers according to SVSHAPE* for next sv.* instruction
svremap 0x1F, 0, 1, 0, 0, 0, 0
# run the vector reduction, result in f4
sv.fadd *f4, *f4, *f4
|
This is a pretty compelling argument to me, even if it means things are slightly less efficient on some architectures. |
currently llvm tends to have code quality issues when the vector length is substantially larger than native, regardless of which algorithm is being looked at (this is not specific to reductions), so reduction having bad codegen when using a vector length 16 times larger than native is imho not really a problem caused by the specific |
Yeah, the f32x64 case was an investigation of the extreme case since that's currently the highest lane count available. The f32x32 and f32x16 cases are more practical due to the lower induced register pressure from register splitting. I don't have enough experience yet with LLVM in this specific area to weigh in on code quality issues with width lowering more generally, but in this specific case we're discussing there is nothing LLVM or even a human could do given how the reduction order is specified. So it does seem worth singling out. And if the code quality issues with width lowering are so severe in general, it makes it very hard to use something like std::simd for widths beyond a common denominator f32x4 if it means suffering severe regressions on native f32x4 targets. Has anyone involved with the project tried to characterize these codegen issues systematically to guide these sorts of decisions? Otherwise, as I continue kicking the tires of std::simd I'll try to gather some data. |
I think this sort of issue is best resolved by having some kind of API that tells you what is the optimal vector width on the active hardware. I once asked for something like this to be added to std::simd, but it was deemed out of scope. However, the target_features crate provides such a const fn nowadays. |
Thanks, that's the part I had missed. It might also be worth looking into what affordances and preferences SVE and RISC-V's vector extension have when it comes to reductions since these all seem in the general bucket of scalable vector ISAs that hearken back more to Cray-style vectors than traditional SIMD ISAs. |
Offering such an API might be helpful but it has some downsides:
Edit: I see you mentioned a const fn, so I guess you meant it would query the vector width on the machine where the program was compiled. That only works for programs that are compiled from source on the end user's own machine. At least for the kind of software I mostly work on, that's not how we distribute software to end users. And I think that once you start baking in hardware vector widths like that, it loses a lot of the appeal to me of a portable SIMD solution. |
I wanted to ask here instead of opening a new issue since I've been trying to make some dot products fast in Rust for machine learning purposes, but is this issue blocked on something at the moment? This might be naive but it seems like there are some great solutions proposed above and maybe this is in the realm of "perfect is the enemy of good"? At least I think the machine learning use-case doesn't need stability in the underlying reduction algorithm (I understand that there are users that care more). For the small point that it's worth I mostly work on software that's distributed compiled on the author's machine and so agree with the above statement about a const fn vector width query not being ideal. I'm happy to run codegen on (some) variety of hardware if that's helpful. |
Currently, the horizontal ops use a sequential ordering. This ordering shouldn't matter at all for integers or operations like min/max, because those should be associative, last I checked.
But it does matter for floats and their sums/products, a lot, as using a strictly sequential order not only is not very fast due to introducing data dependencies, it also introduces close to the maximum error possible. IEEE-754 states the order of the
sum
function on a vector may be reassociated, almost certainly precisely for this reason, and most of the hardware we are targeting supports pairwise summation or product operations. These are also called "tree reductions" for reasons that are obvious when you tilt your head and actually stare at what happens. They should be reasonable to emulate in software even in their absence.I know @gnzlbg was intending to use tree reductions for
packed_simd
. We can just add the "allow reassociation" flags to the ops, and that was what the intrinsics originally did, but that makes the order totally unspecified, unfortunately. Perhaps LLVM always uses tree reductions to implement this, but they don't promise to do so. It does seem to make things go faster, though:https://llvm.godbolt.org/z/dhbdoeq4d
We would like to still have a consistent ordering, as that would be most portable (the algorithm would produce the same results on all machines). We should explore if it's possible to consistently make LLVM emit pairwise sums of floats ops, if we can, and the performance implications.
This got discussed extensively in the original Rust Zulip thread.
There's a Julia forum thread on summation algorithms that probably contains relevant content.
We should adopt a resolution to commit to a
horizontal_{sum,product}
ordering (or lack thereof) before we stabilize, so marking this as blocking stable. Some actionable steps to take first:The text was updated successfully, but these errors were encountered: