Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mask8x8::from_bitmask falls back to scalar code #264

Open
lukaslihotzki opened this issue Mar 13, 2022 · 7 comments
Open

mask8x8::from_bitmask falls back to scalar code #264

lukaslihotzki opened this issue Mar 13, 2022 · 7 comments
Labels
C-bug Category: Bug I-scalarize Impact: code that should be vectorized, isn't

Comments

@lukaslihotzki
Copy link

lukaslihotzki commented Mar 13, 2022

I tried this code:

pub fn func(a: u8, b: u64) -> u8 {
    let c = mask8x8::from_bitmask(a).to_int().cast();
    let d = c | u8x8::from_array(b.to_le_bytes());
    d.horizontal_and()
}

I expected to see this happen: vectorized mask8x8::from_bitmask, for example like this Rust code:

u8x8::splat(0).lanes_ne(u8x8::splat(a) & u8x8::from_array([1, 2, 4, 8, 16, 32, 64, 128]))

(on x86 with appropriate target-cpu, using PDEP may be the best approach.)

Instead, this happened on x86: Each bit is extracted individually by movl, shrb, andb to its own general-purpose register and then inserted with vpinsrb or pinsrw (depending on target-cpu). After that, the bits are expanded to 0x00 or 0xff using vectorized code. Scalar bit extraction needs more instructions and more runtime than vectorized code. Also, it may pressure the register allocator in more complex functions.

Meta

rustc --version --verbose:

rustc 1.61.0-nightly (f103b2969 2022-03-12)
binary: rustc
commit-hash: f103b2969b0088953873dc1ac92eb3387c753596
commit-date: 2022-03-12
host: x86_64-unknown-linux-gnu
release: 1.61.0-nightly
LLVM version: 14.0.0
@lukaslihotzki lukaslihotzki added the C-bug Category: Bug label Mar 13, 2022
@lukaslihotzki
Copy link
Author

Also in wasm32 (+simd128), each bit is extracted individually with i32.shr_u, i32.and, and i16x8.replace_lane.

@lukaslihotzki lukaslihotzki changed the title mask8x8::from_bitmask falls back to scalar code on x86 mask8x8::from_bitmask falls back to scalar code Mar 13, 2022
@calebzulawski
Copy link
Member

I'm not sure if this is actually the reason, but it looks like pdep has terrible performance on Zen 1 and 2 so this may be compromising for those CPUs. AVX-512 for example uses vpternlogd: https://rust.godbolt.org/z/oaEz9hvsY, so LLVM is optimizing it sometimes at least.

Does wasm32+simd128 have a better instruction for this?

@programmerjake
Copy link
Member

I'm not sure if this is actually the reason, but it looks like pdep has terrible performance on Zen 1 and 2 so this may be compromising for those CPUs. AVX-512 for example uses vpternlogd: https://rust.godbolt.org/z/oaEz9hvsY, so LLVM is optimizing it sometimes at least.

you forgot to enable avx512bw, avx512 can do this natively with just bitmasking:
https://rust.godbolt.org/z/o1qbWvvqM

@programmerjake
Copy link
Member

apparently llvm already has a fix for at least x86 without avx512: llvm/llvm-project#53760

@lukaslihotzki
Copy link
Author

Does wasm32+simd128 have a better instruction for this?

I don't think there's a better way for wasm32+simd128 than the portable, vectorized approach from the task description. If someone finds a creative way to do it in less instructions, the question is how it's lowered on the actual target platforms.

apparently llvm already has a fix for at least x86 without avx512: llvm/llvm-project#53760

llvm/llvm-project#53760 (comment): Only mask32x8::from_bitmask should be fixed, mask8x8::from_bitmask and mask16x8::from_bitmask probably still produce scalar code. Replacing --mcpu=haswell with --march=wasm32 --mattr=simd128 in the godbolt example (done in https://godbolt.org/z/rfoWdEezj) shows wasm32 is still affected in llc (trunc).

@jhorstmann
Copy link

Seems the LLVM fix did not solve it for all vector types, even though the code looks rather generic. The fix was also specific to the X86 instruction selection, wasm will probably require different changes.

In the arrow-rs project we implemented these portably using packed_simd for several different types, maybe that can give you some ideas for a workaround. In the X86 world the problem was that those portable implementations were not ideal for AVX512 mask registers. The LLVM based solution should work better for both bitmasks and vector masks.

@calebzulawski
Copy link
Member

I was thinking something is a little suspicious about this. A simpler example expands the bitmask as expected: https://rust.godbolt.org/z/9P8d5qMss. It's not so much that LLVM doesn't know how to generate this code--it just fails to in some circumstances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: Bug I-scalarize Impact: code that should be vectorized, isn't
Projects
None yet
Development

No branches or pull requests

5 participants