-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aarch64] bitcast <N x i1> to iN
produces bad assembly
#59686
Comments
@llvm/issue-subscribers-backend-aarch64 |
The first line of the IR isn't necessary for this actually. This code shows pretty much the same thing (2 instructions on x86-64, ~50 on aarch64): define i16 @cast(<16 x i1> %bits) {
%ret = bitcast <16 x i1> %bits to i16
ret i16 %ret
} |
This blogpost (section 'Sometimes all it takes is a MUL') explains another possible implementation, define i16 @movemask(<16 x i1> %mask) {
%bytemask = sext <16 x i1> %mask to <16 x i8>
%mask2 = bitcast <16 x i8> %bytemask to <2 x i64>
%m0 = extractelement <2 x i64> %mask2, i32 0
%m1 = extractelement <2 x i64> %mask2, i32 1
%mul0 = mul i64 %m0, u0x103070F1F3F80
%mul1 = mul i64 %m1, u0x103070F1F3F80
%lo64 = lshr i64 %mul0, 56
%lo16 = trunc i64 %lo64 to i16
%hi64 = lshr i64 %mul1, 48
%hi16 = trunc i64 %hi64 to i16
%hi16.masked = and i16 %hi16, u0xFF00
%ret = add i16 %lo16, %hi16.masked
ret i16 %ret
} Or like this with the code taken from the 'Removing the shift' section: define i16 @movemask(<16 x i1> %mask) {
%bytemask = sext <16 x i1> %mask to <16 x i8>
%mask2 = bitcast <16 x i8> %bytemask to <2 x i64>
%m0 = extractelement <2 x i64> %mask2, i32 0
%m0.ext = zext i64 %m0 to i128
%m1 = extractelement <2 x i64> %mask2, i32 1
%m1.ext = zext i64 %m1 to i128
%mul0 = mul i128 %m0.ext, u0x103070F1F3F8000
%mul1 = mul i128 %m1.ext, u0x103070F1F3F8000
%lo128 = lshr i128 %mul0, 64
%lo16 = trunc i128 %lo128 to i16
%hi128 = lshr i128 %mul1, 64
%hi16 = trunc i128 %hi128 to i16
%hi16.shifted = shl i16 %hi16, 8
%ret = add i16 %lo16, %hi16.shifted
ret i16 %ret
} These implementations also work on ARMv8 and lower, not just on aarch64. According to the blog post they're also a bit faster. |
Essentially, trying to recreate the Intel
movemask
intrinsics on aarch64 produces extremely long assembly. Rust's nightlysimd::Mask::to_bitmask()
suffers from this.Given the following IR:
On x86-64, it compiles down to just one instruction, as expected:
On aarch64 however, it takes a whopping 50 instructions to do the same operation:
aarch64 doesn't have a
movemask
instruction like x86-64 does, but it's possible to simulate its behavior using way fewer instructions, e.g. like so:(compiled from https://stackoverflow.com/a/58381188 )
I'm not at all familiar with codegen, but I would hope that it's possible to use some clever algorithm to create assembly that's closer to optimal for all vector lengths.
The text was updated successfully, but these errors were encountered: