mask8x8::from_bitmask falls back to scalar code #264

lukaslihotzki · 2022-03-13T13:24:37Z

I tried this code:

pub fn func(a: u8, b: u64) -> u8 {
    let c = mask8x8::from_bitmask(a).to_int().cast();
    let d = c | u8x8::from_array(b.to_le_bytes());
    d.horizontal_and()
}

I expected to see this happen: vectorized mask8x8::from_bitmask, for example like this Rust code:

u8x8::splat(0).lanes_ne(u8x8::splat(a) & u8x8::from_array([1, 2, 4, 8, 16, 32, 64, 128]))

(on x86 with appropriate target-cpu, using PDEP may be the best approach.)

Instead, this happened on x86: Each bit is extracted individually by movl, shrb, andb to its own general-purpose register and then inserted with vpinsrb or pinsrw (depending on target-cpu). After that, the bits are expanded to 0x00 or 0xff using vectorized code. Scalar bit extraction needs more instructions and more runtime than vectorized code. Also, it may pressure the register allocator in more complex functions.

Meta

rustc --version --verbose:

rustc 1.61.0-nightly (f103b2969 2022-03-12)
binary: rustc
commit-hash: f103b2969b0088953873dc1ac92eb3387c753596
commit-date: 2022-03-12
host: x86_64-unknown-linux-gnu
release: 1.61.0-nightly
LLVM version: 14.0.0

The text was updated successfully, but these errors were encountered:

lukaslihotzki · 2022-03-13T15:29:02Z

Also in wasm32 (+simd128), each bit is extracted individually with i32.shr_u, i32.and, and i16x8.replace_lane.

calebzulawski · 2022-03-13T16:27:59Z

I'm not sure if this is actually the reason, but it looks like pdep has terrible performance on Zen 1 and 2 so this may be compromising for those CPUs. AVX-512 for example uses vpternlogd: https://rust.godbolt.org/z/oaEz9hvsY, so LLVM is optimizing it sometimes at least.

Does wasm32+simd128 have a better instruction for this?

programmerjake · 2022-03-13T20:33:55Z

I'm not sure if this is actually the reason, but it looks like pdep has terrible performance on Zen 1 and 2 so this may be compromising for those CPUs. AVX-512 for example uses vpternlogd: https://rust.godbolt.org/z/oaEz9hvsY, so LLVM is optimizing it sometimes at least.

you forgot to enable avx512bw, avx512 can do this natively with just bitmasking:
https://rust.godbolt.org/z/o1qbWvvqM

programmerjake · 2022-03-13T20:38:35Z

apparently llvm already has a fix for at least x86 without avx512: llvm/llvm-project#53760

lukaslihotzki · 2022-03-13T23:20:46Z

Does wasm32+simd128 have a better instruction for this?

I don't think there's a better way for wasm32+simd128 than the portable, vectorized approach from the task description. If someone finds a creative way to do it in less instructions, the question is how it's lowered on the actual target platforms.

apparently llvm already has a fix for at least x86 without avx512: llvm/llvm-project#53760

llvm/llvm-project#53760 (comment): Only mask32x8::from_bitmask should be fixed, mask8x8::from_bitmask and mask16x8::from_bitmask probably still produce scalar code. Replacing --mcpu=haswell with --march=wasm32 --mattr=simd128 in the godbolt example (done in https://godbolt.org/z/rfoWdEezj) shows wasm32 is still affected in llc (trunc).

jhorstmann · 2022-03-14T10:25:29Z

Seems the LLVM fix did not solve it for all vector types, even though the code looks rather generic. The fix was also specific to the X86 instruction selection, wasm will probably require different changes.

In the arrow-rs project we implemented these portably using packed_simd for several different types, maybe that can give you some ideas for a workaround. In the X86 world the problem was that those portable implementations were not ideal for AVX512 mask registers. The LLVM based solution should work better for both bitmasks and vector masks.

calebzulawski · 2022-03-15T03:05:42Z

I was thinking something is a little suspicious about this. A simpler example expands the bitmask as expected: https://rust.godbolt.org/z/9P8d5qMss. It's not so much that LLVM doesn't know how to generate this code--it just fails to in some circumstances.

lukaslihotzki added the C-bug Category: Bug label Mar 13, 2022

lukaslihotzki changed the title ~~mask8x8::from_bitmask falls back to scalar code on x86~~ mask8x8::from_bitmask falls back to scalar code Mar 13, 2022

lukaslihotzki mentioned this issue Mar 15, 2022

Inefficient code for Nxi1 masked operations on non-avx512 target llvm/llvm-project#53760

Closed

jhorstmann mentioned this issue Mar 27, 2022

packed_simd v.s. portable_simd, which should be used? apache/arrow-rs#1492

Closed

workingjubilee added the I-scalarize Impact: code that should be vectorized, isn't label Mar 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mask8x8::from_bitmask falls back to scalar code #264

mask8x8::from_bitmask falls back to scalar code #264

lukaslihotzki commented Mar 13, 2022 •

edited

Loading

lukaslihotzki commented Mar 13, 2022

calebzulawski commented Mar 13, 2022

programmerjake commented Mar 13, 2022

programmerjake commented Mar 13, 2022

lukaslihotzki commented Mar 13, 2022

jhorstmann commented Mar 14, 2022

calebzulawski commented Mar 15, 2022

mask8x8::from_bitmask falls back to scalar code #264

mask8x8::from_bitmask falls back to scalar code #264

Comments

lukaslihotzki commented Mar 13, 2022 • edited Loading

Meta

lukaslihotzki commented Mar 13, 2022

calebzulawski commented Mar 13, 2022

programmerjake commented Mar 13, 2022

programmerjake commented Mar 13, 2022

lukaslihotzki commented Mar 13, 2022

jhorstmann commented Mar 14, 2022

calebzulawski commented Mar 15, 2022

lukaslihotzki commented Mar 13, 2022 •

edited

Loading