[aarch64] `bitcast <N x i1> to iN` produces bad assembly #59686

Sp00ph · 2022-12-23T23:02:21Z

Essentially, trying to recreate the Intel movemask intrinsics on aarch64 produces extremely long assembly. Rust's nightly simd::Mask::to_bitmask() suffers from this.

Given the following IR:

define i16 @movemask(<16 x i8> %mask) {
    %bits = icmp slt <16 x i8> %mask, zeroinitializer
    %ret = bitcast <16 x i1> %bits to i16
    ret i16 %ret
}

On x86-64, it compiles down to just one instruction, as expected:

movemask:
        pmovmskb        eax, xmm0
        ret

On aarch64 however, it takes a whopping 50 instructions to do the same operation:

movemask:
        sub     sp, sp, #16
        cmlt    v0.16b, v0.16b, #0
        umov    w8, v0.b[1]
        umov    w10, v0.b[2]
        umov    w9, v0.b[0]
        umov    w11, v0.b[3]
        umov    w12, v0.b[4]
        umov    w13, v0.b[5]
        and     w8, w8, #0x1
        and     w10, w10, #0x1
        and     w9, w9, #0x1
        and     w11, w11, #0x1
        and     w12, w12, #0x1
        and     w13, w13, #0x1
        bfi     w9, w8, #1, #1
        umov    w8, v0.b[6]
        bfi     w9, w10, #2, #1
        umov    w10, v0.b[7]
        bfi     w9, w11, #3, #1
        umov    w11, v0.b[8]
        bfi     w9, w12, #4, #1
        umov    w12, v0.b[9]
        and     w8, w8, #0x1
        bfi     w9, w13, #5, #1
        umov    w13, v0.b[10]
        and     w10, w10, #0x1
        orr     w8, w9, w8, lsl #6
        umov    w9, v0.b[11]
        and     w11, w11, #0x1
        orr     w8, w8, w10, lsl #7
        umov    w10, v0.b[12]
        and     w12, w12, #0x1
        orr     w8, w8, w11, lsl #8
        umov    w11, v0.b[13]
        and     w13, w13, #0x1
        orr     w8, w8, w12, lsl #9
        umov    w12, v0.b[14]
        and     w9, w9, #0x1
        orr     w8, w8, w13, lsl #10
        and     w10, w10, #0x1
        orr     w8, w8, w9, lsl #11
        and     w9, w11, #0x1
        umov    w11, v0.b[15]
        orr     w8, w8, w10, lsl #12
        and     w10, w12, #0x1
        orr     w8, w8, w9, lsl #13
        orr     w8, w8, w10, lsl #14
        orr     w8, w8, w11, lsl #15
        and     w0, w8, #0xffff
        add     sp, sp, #16
        ret

aarch64 doesn't have a movemask instruction like x86-64 does, but it's possible to simulate its behavior using way fewer instructions, e.g. like so:

movemask:
        ushr    v0.16b, v0.16b, #7
        usra    v0.8h, v0.8h, #7
        usra    v0.4s, v0.4s, #14
        usra    v0.2d, v0.2d, #28
        umov    w0, v0.b[0]
        umov    w8, v0.b[8]
        bfi     w0, w8, #8, #24
        ret

(compiled from https://stackoverflow.com/a/58381188 )

I'm not at all familiar with codegen, but I would hope that it's possible to use some clever algorithm to create assembly that's closer to optimal for all vector lengths.

The text was updated successfully, but these errors were encountered:

llvmbot · 2022-12-24T01:29:49Z

@llvm/issue-subscribers-backend-aarch64

Sp00ph · 2022-12-24T02:20:27Z

The first line of the IR isn't necessary for this actually. This code shows pretty much the same thing (2 instructions on x86-64, ~50 on aarch64):

define i16 @cast(<16 x i1> %bits) {
    %ret = bitcast <16 x i1> %bits to i16
    ret i16 %ret
}

Sp00ph · 2022-12-25T18:17:24Z

This blogpost (section 'Sometimes all it takes is a MUL') explains another possible implementation,
which could look something like this:

define i16 @movemask(<16 x i1> %mask) {
    %bytemask = sext <16 x i1> %mask to <16 x i8>
    %mask2 = bitcast <16 x i8> %bytemask to <2 x i64>
    %m0 = extractelement <2 x i64> %mask2, i32 0
    %m1 = extractelement <2 x i64> %mask2, i32 1
    %mul0 = mul i64 %m0, u0x103070F1F3F80
    %mul1 = mul i64 %m1, u0x103070F1F3F80
    %lo64 = lshr i64 %mul0, 56
    %lo16 = trunc i64 %lo64 to i16
    %hi64 = lshr i64 %mul1, 48
    %hi16 = trunc i64 %hi64 to i16
    %hi16.masked = and i16 %hi16, u0xFF00
    %ret = add i16 %lo16, %hi16.masked
    ret i16 %ret
}

Or like this with the code taken from the 'Removing the shift' section:

define i16 @movemask(<16 x i1> %mask) {
    %bytemask = sext <16 x i1> %mask to <16 x i8>
    %mask2 = bitcast <16 x i8> %bytemask to <2 x i64>
    %m0 = extractelement <2 x i64> %mask2, i32 0
    %m0.ext = zext i64 %m0 to i128
    %m1 = extractelement <2 x i64> %mask2, i32 1
    %m1.ext = zext i64 %m1 to i128
    %mul0 = mul i128 %m0.ext, u0x103070F1F3F8000
    %mul1 = mul i128 %m1.ext, u0x103070F1F3F8000
    %lo128 = lshr i128 %mul0, 64
    %lo16 = trunc i128 %lo128 to i16
    %hi128 = lshr i128 %mul1, 64
    %hi16 = trunc i128 %hi128 to i16
    %hi16.shifted = shl i16 %hi16, 8
    %ret = add i16 %lo16, %hi16.shifted
    ret i16 %ret
}

These implementations also work on ARMv8 and lower, not just on aarch64. According to the blog post they're also a bit faster.

github-actions bot added the new issue label Dec 23, 2022

EugeneZelenko added backend:AArch64 and removed new issue labels Dec 24, 2022

Sp00ph mentioned this issue Feb 11, 2023

Simd-using functions sometimes scalarize after inlining, even if they use vector ops on their own rust-lang/portable-simd#321

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aarch64] `bitcast <N x i1> to iN` produces bad assembly #59686

[aarch64] `bitcast <N x i1> to iN` produces bad assembly #59686

Sp00ph commented Dec 23, 2022

llvmbot commented Dec 24, 2022

Sp00ph commented Dec 24, 2022

Sp00ph commented Dec 25, 2022 •

edited

Loading

[aarch64] bitcast <N x i1> to iN produces bad assembly #59686

[aarch64] bitcast <N x i1> to iN produces bad assembly #59686

Comments

Sp00ph commented Dec 23, 2022

llvmbot commented Dec 24, 2022

Sp00ph commented Dec 24, 2022

Sp00ph commented Dec 25, 2022 • edited Loading

[aarch64] `bitcast <N x i1> to iN` produces bad assembly #59686

[aarch64] `bitcast <N x i1> to iN` produces bad assembly #59686

Sp00ph commented Dec 25, 2022 •

edited

Loading