-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mask8x8::from_bitmask falls back to scalar code #264
Comments
Also in wasm32 (+simd128), each bit is extracted individually with |
I'm not sure if this is actually the reason, but it looks like Does wasm32+simd128 have a better instruction for this? |
you forgot to enable avx512bw, avx512 can do this natively with just bitmasking: |
apparently llvm already has a fix for at least x86 without avx512: llvm/llvm-project#53760 |
I don't think there's a better way for wasm32+simd128 than the portable, vectorized approach from the task description. If someone finds a creative way to do it in less instructions, the question is how it's lowered on the actual target platforms.
llvm/llvm-project#53760 (comment): Only |
Seems the LLVM fix did not solve it for all vector types, even though the code looks rather generic. The fix was also specific to the X86 instruction selection, wasm will probably require different changes. In the arrow-rs project we implemented these portably using packed_simd for several different types, maybe that can give you some ideas for a workaround. In the X86 world the problem was that those portable implementations were not ideal for AVX512 mask registers. The LLVM based solution should work better for both bitmasks and vector masks. |
I was thinking something is a little suspicious about this. A simpler example expands the bitmask as expected: https://rust.godbolt.org/z/9P8d5qMss. It's not so much that LLVM doesn't know how to generate this code--it just fails to in some circumstances. |
I tried this code:
I expected to see this happen: vectorized
mask8x8::from_bitmask
, for example like this Rust code:(on x86 with appropriate target-cpu, using PDEP may be the best approach.)
Instead, this happened on x86: Each bit is extracted individually by
movl
,shrb
,andb
to its own general-purpose register and then inserted withvpinsrb
orpinsrw
(depending on target-cpu). After that, the bits are expanded to 0x00 or 0xff using vectorized code. Scalar bit extraction needs more instructions and more runtime than vectorized code. Also, it may pressure the register allocator in more complex functions.Meta
rustc --version --verbose
:The text was updated successfully, but these errors were encountered: