Suboptimal codegen for mask8x32 all #316

Nugine · 2022-11-18T08:45:30Z

https://rust.godbolt.org/z/b4cdaqTcP

pub fn mask8x32_all_v1(m: mask8x32) -> bool {
    m.all()
}

pub unsafe fn mask8x32_all_v2(m: mask8x32) -> bool {
    let [a, b]: [__m128i; 2] = transmute(m);
    _mm_movemask_epi8(_mm_and_si128(a, b)) as i16 == -1
}

-C opt-level=3 --edition 2021 --target x86_64-unknown-linux-gnu -C target-feature=+sse2

example::mask8x32_all_v1:
        movdqa  xmm0, xmmword ptr [rdi + 16]
        pand    xmm0, xmmword ptr [rdi]
        psllw   xmm0, 7
        pmovmskb        eax, xmm0
        cmp     ax, -1
        sete    al
        ret

example::mask8x32_all_v2:
        movdqa  xmm0, xmmword ptr [rdi + 16]
        pand    xmm0, xmmword ptr [rdi]
        pmovmskb        eax, xmm0
        cmp     eax, 65535
        sete    al
        ret

mask8x32_all_v1 generates an extra psllw instruction, which is unnecessary.

Same bug for wasm32 simd128: https://rust.godbolt.org/z/7r1fKhsM9

The text was updated successfully, but these errors were encountered:

TheIronBorn · 2022-11-18T10:10:08Z

there are a lot of missed optimizations with masks because LLVM defines them as a single bit and apparently doesn't adapt for the way x86 at least handles them

jhorstmann · 2022-11-18T11:05:45Z

This seems to be an artifact of the llvm trunc instruction

define i16 @test_trunc_and_bit_cast_i16(<16 x i8>* %pa) {
  %a = load <16 x i8>, <16 x i8>* %pa, align 32
  %b = trunc <16 x i8> %a to <16 x i1>
  %r = bitcast <16 x i1> %b to i16
  ret i16 %r
}

compiles to

test_trunc_and_bit_cast_i16: # @test_trunc_and_bit_cast_i16
  vmovdqa xmm0, xmmword ptr [rdi]
  vpsllw xmm0, xmm0, 7
  vpmovmskb eax, xmm0
  ret

So truncate means to use the low bits of the mask, the shift then moves that bit into the highes bit for use with movmask.

It seems inserting an arithmetic shift to duplicate the high bits into all positions makes llvm optimize this away.

define i16 @test_trunc_and_bit_cast_i16_shr(<16 x i8>* %pa) {
  %a = load <16 x i8>, <16 x i8>* %pa, align 32
  %b = ashr <16 x i8> %a, <i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7>
  %c = trunc <16 x i8> %b to <16 x i1>
  %r = bitcast <16 x i1> %c to i16
  ret i16 %r
}

test_trunc_and_bit_cast_i16_shr: # @test_trunc_and_bit_cast_i16_shr
  vmovdqa xmm0, xmmword ptr [rdi]
  vpmovmskb eax, xmm0
  ret

Such a shift is done for simd_bitmask but might be missing for bitwise_red

programmerjake · 2022-11-18T23:52:02Z

imho the real reason masks constantly have codegen issues is llvm has no annotation or something so it knows a vector is really a mask vector where element values are either 0 or -1 and never anything else.

calebzulawski · 2022-11-19T00:49:11Z

I wonder if that would possibly fix our various reduction issues on arm/aarch64...

This improves the codegen for vector `select`, `gather`, `scatter` and boolean reduction intrinsics and fixes rust-lang/portable-simd#316. The current behavior of mask operations during llvm codegen is to truncate the mask vector to <N x i1>, telling llvm to use the least significat bit. Since sse/avx instructions are defined to use the most significant bit, llvm has to insert a left shift before the mask can actually be used. Similarly on aarch64, mask operations like blend work bit by bit, repeating the least significant bit across the whole lane involves shifting it into the sign position and then comparing against zero. By shifting before truncating to <N x i1>, we tell llvm that we only consider the most significant bit, removing the need for additional shift instructions in the assembly.

Nugine mentioned this issue Nov 18, 2022

Tracking issue for portable-simd migration Nugine/simd#7

Open

jhorstmann linked a pull request Nov 21, 2022 that will close this issue

Consistently use the highest bit of vector masks when converting to i1 vectors rust-lang/rust#104693

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal codegen for mask8x32 all #316

Suboptimal codegen for mask8x32 all #316

Nugine commented Nov 18, 2022 •

edited

Loading

TheIronBorn commented Nov 18, 2022

jhorstmann commented Nov 18, 2022

programmerjake commented Nov 18, 2022

calebzulawski commented Nov 19, 2022

Suboptimal codegen for mask8x32 all #316

Suboptimal codegen for mask8x32 all #316

Comments

Nugine commented Nov 18, 2022 • edited Loading

TheIronBorn commented Nov 18, 2022

jhorstmann commented Nov 18, 2022

programmerjake commented Nov 18, 2022

calebzulawski commented Nov 19, 2022

Nugine commented Nov 18, 2022 •

edited

Loading