Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aarch64] bitcast <N x i1> to iN produces bad assembly #59686

Open
Sp00ph opened this issue Dec 23, 2022 · 3 comments
Open

[aarch64] bitcast <N x i1> to iN produces bad assembly #59686

Sp00ph opened this issue Dec 23, 2022 · 3 comments

Comments

@Sp00ph
Copy link
Contributor

Sp00ph commented Dec 23, 2022

Essentially, trying to recreate the Intel movemask intrinsics on aarch64 produces extremely long assembly. Rust's nightly simd::Mask::to_bitmask() suffers from this.

Given the following IR:

define i16 @movemask(<16 x i8> %mask) {
    %bits = icmp slt <16 x i8> %mask, zeroinitializer
    %ret = bitcast <16 x i1> %bits to i16
    ret i16 %ret
}

On x86-64, it compiles down to just one instruction, as expected:

movemask:
        pmovmskb        eax, xmm0
        ret

On aarch64 however, it takes a whopping 50 instructions to do the same operation:

movemask:
        sub     sp, sp, #16
        cmlt    v0.16b, v0.16b, #0
        umov    w8, v0.b[1]
        umov    w10, v0.b[2]
        umov    w9, v0.b[0]
        umov    w11, v0.b[3]
        umov    w12, v0.b[4]
        umov    w13, v0.b[5]
        and     w8, w8, #0x1
        and     w10, w10, #0x1
        and     w9, w9, #0x1
        and     w11, w11, #0x1
        and     w12, w12, #0x1
        and     w13, w13, #0x1
        bfi     w9, w8, #1, #1
        umov    w8, v0.b[6]
        bfi     w9, w10, #2, #1
        umov    w10, v0.b[7]
        bfi     w9, w11, #3, #1
        umov    w11, v0.b[8]
        bfi     w9, w12, #4, #1
        umov    w12, v0.b[9]
        and     w8, w8, #0x1
        bfi     w9, w13, #5, #1
        umov    w13, v0.b[10]
        and     w10, w10, #0x1
        orr     w8, w9, w8, lsl #6
        umov    w9, v0.b[11]
        and     w11, w11, #0x1
        orr     w8, w8, w10, lsl #7
        umov    w10, v0.b[12]
        and     w12, w12, #0x1
        orr     w8, w8, w11, lsl #8
        umov    w11, v0.b[13]
        and     w13, w13, #0x1
        orr     w8, w8, w12, lsl #9
        umov    w12, v0.b[14]
        and     w9, w9, #0x1
        orr     w8, w8, w13, lsl #10
        and     w10, w10, #0x1
        orr     w8, w8, w9, lsl #11
        and     w9, w11, #0x1
        umov    w11, v0.b[15]
        orr     w8, w8, w10, lsl #12
        and     w10, w12, #0x1
        orr     w8, w8, w9, lsl #13
        orr     w8, w8, w10, lsl #14
        orr     w8, w8, w11, lsl #15
        and     w0, w8, #0xffff
        add     sp, sp, #16
        ret

aarch64 doesn't have a movemask instruction like x86-64 does, but it's possible to simulate its behavior using way fewer instructions, e.g. like so:

movemask:
        ushr    v0.16b, v0.16b, #7
        usra    v0.8h, v0.8h, #7
        usra    v0.4s, v0.4s, #14
        usra    v0.2d, v0.2d, #28
        umov    w0, v0.b[0]
        umov    w8, v0.b[8]
        bfi     w0, w8, #8, #24
        ret

(compiled from https://stackoverflow.com/a/58381188 )

I'm not at all familiar with codegen, but I would hope that it's possible to use some clever algorithm to create assembly that's closer to optimal for all vector lengths.

@llvmbot
Copy link
Member

llvmbot commented Dec 24, 2022

@llvm/issue-subscribers-backend-aarch64

@Sp00ph
Copy link
Contributor Author

Sp00ph commented Dec 24, 2022

The first line of the IR isn't necessary for this actually. This code shows pretty much the same thing (2 instructions on x86-64, ~50 on aarch64):

define i16 @cast(<16 x i1> %bits) {
    %ret = bitcast <16 x i1> %bits to i16
    ret i16 %ret
}

@Sp00ph
Copy link
Contributor Author

Sp00ph commented Dec 25, 2022

This blogpost (section 'Sometimes all it takes is a MUL') explains another possible implementation,
which could look something like this:

define i16 @movemask(<16 x i1> %mask) {
    %bytemask = sext <16 x i1> %mask to <16 x i8>
    %mask2 = bitcast <16 x i8> %bytemask to <2 x i64>
    %m0 = extractelement <2 x i64> %mask2, i32 0
    %m1 = extractelement <2 x i64> %mask2, i32 1
    %mul0 = mul i64 %m0, u0x103070F1F3F80
    %mul1 = mul i64 %m1, u0x103070F1F3F80
    %lo64 = lshr i64 %mul0, 56
    %lo16 = trunc i64 %lo64 to i16
    %hi64 = lshr i64 %mul1, 48
    %hi16 = trunc i64 %hi64 to i16
    %hi16.masked = and i16 %hi16, u0xFF00
    %ret = add i16 %lo16, %hi16.masked
    ret i16 %ret
}

Or like this with the code taken from the 'Removing the shift' section:

define i16 @movemask(<16 x i1> %mask) {
    %bytemask = sext <16 x i1> %mask to <16 x i8>
    %mask2 = bitcast <16 x i8> %bytemask to <2 x i64>
    %m0 = extractelement <2 x i64> %mask2, i32 0
    %m0.ext = zext i64 %m0 to i128
    %m1 = extractelement <2 x i64> %mask2, i32 1
    %m1.ext = zext i64 %m1 to i128
    %mul0 = mul i128 %m0.ext, u0x103070F1F3F8000
    %mul1 = mul i128 %m1.ext, u0x103070F1F3F8000
    %lo128 = lshr i128 %mul0, 64
    %lo16 = trunc i128 %lo128 to i16
    %hi128 = lshr i128 %mul1, 64
    %hi16 = trunc i128 %hi128 to i16
    %hi16.shifted = shl i16 %hi16, 8
    %ret = add i16 %lo16, %hi16.shifted
    ret i16 %ret
}

These implementations also work on ARMv8 and lower, not just on aarch64. According to the blog post they're also a bit faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants