Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Vector128.ExtractMostSignificantBits for arm64 #76047

Open
Tracked by #94464
EgorBo opened this issue Sep 23, 2022 · 3 comments
Open
Tracked by #94464

Improve Vector128.ExtractMostSignificantBits for arm64 #76047

EgorBo opened this issue Sep 23, 2022 · 3 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2022

Per discussion with @tannergooding on Discord

Vector128.ExtractMostSignificantBits is quite an important API that is typically used together with comparisons and TrailingZeroCount/LeadingZeroCount to detect positions of an element in a vector - typically used in various IndexOf-like algorithms, etc. Example:

static void PrintPostion()
{
    Vector128<byte> src = Vector128.Create((byte)0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
    Vector128<byte> val = Vector128.Create((byte)42);

    // prints 3 as the index of 42 is "3" in src vector
    Console.WriteLine(FirstMatch(src, val)); 
}

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(src, val);
    return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
}

Codegen for FirstMatch on x64:

       vpcmpeqb  xmm0, xmm0, xmm1
       vpmovmskb eax, xmm0
       tzcnt     eax, eax

Codegen for FirstMatch on arm64:

            cmeq    v16.16b, v0.16b, v1.16b
            ldr     q17, [@RWD00]
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD16]
            ushl    v16.16b, v16.16b, v17.16b
            movi    v17.4s, #0
            ext     v17.16b, v16.16b, v17.16b, #8
            addv    b17, v17.8b
            umov    w0, v17.b[0]
            lsl     w0, w0, #8
            addv    b16, v16.8b
            umov    w1, v16.b[0]
            orr     w0, w0, w1
            rbit    w0, w0
            clz     w0, w0

Because arm64 doesn't have a direct equivalent of movmsk. However, this particular case can be optimized because we know that input of ExtractMostSignificantBits is a comparison's result with all elements being either zero or all-bits-set, in the best case we can perform this smart trick from arm blog by @danlark1

            cmeq    v16.16b, v0.16b, v1.16b
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x0, x0
            clz     x0, x0
            asr     w0, w0, #2

its C# equivalent:

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(vector, val);
    ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(matches) >> 2;
}

Performance impact

We expect a nice improvement for small inputs like what http parsing typically sees where we have to find positions of symbols like :, \n etc in relatively small inputs. For large inputs in most of our algorithms we use fast compare == Vector128<>.Zero checks to ignore chunks without matches.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:small
impact:small

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 23, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 23, 2022
@ghost
Copy link

ghost commented Sep 23, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Per discussion with @tannergooding on Discord

Vector128.ExtractMostSignificantBits is quite an important API that is typically used together with comparisons and TrailingZeroCount/LeadingZeroCount to detect positions of an element in a vector - typically used in various IndexOf-like algorithms, etc. Example:

static void PrintPostion()
{
    Vector128<byte> src = Vector128.Create((byte)0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
    Vector128<byte> val = Vector128.Create((byte)42);

    // prints 3 as the index of 42 is "3" in src vector
    Console.WriteLine(FirstMatch(src, val)); 
}

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(src, val);
    return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
}

Codegen for FirstMatch on x64:

       vmovupd  xmm0, xmmword ptr [rcx]
       vpcmpeqb xmm0, xmm0, xmmword ptr [rdx]
       vpmovmskb eax, xmm0
       tzcnt    eax, eax

Codegen for FirstMatch on arm64:

            cmeq    v16.16b, v0.16b, v1.16b
            ldr     q17, [@RWD00]
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD16]
            ushl    v16.16b, v16.16b, v17.16b
            movi    v17.4s, #0
            ext     v17.16b, v16.16b, v17.16b, #8
            addv    b17, v17.8b
            umov    w0, v17.b[0]
            lsl     w0, w0, #8
            addv    b16, v16.8b
            umov    w1, v16.b[0]
            orr     w0, w0, w1
            rbit    w0, w0
            clz     w0, w0

Because arm64 doesn't have a direct equivalent of movmsk. However, this particular case can be optimized because we know that input of ExtractMostSignificantBits is a comparison's result with all elements being either zero or all-bits-set, in the best case we can perform this smart trick from arm blog by @danlark1

            cmeq    v16.16b, v0.16b, v1.16b
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x0, x0
            clz     x0, x0
            asr     w0, w0, #2

its C# equivalent:

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(vector, val);
    ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(matches) >> 2;
}

Performance impact

We expect a nice improvement for small inputs like what http parsing typically sees where we have to find positions of symbols like :, \n etc in relatively small inputs. For large inputs in most of our algorithms we use fast compare == Vector128<>.Zero checks to ignore chunks without matches.

Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@EgorBo EgorBo added this to the Future milestone Sep 23, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Sep 23, 2022
@EgorBo EgorBo self-assigned this Sep 23, 2022
@EgorBo
Copy link
Member Author

EgorBo commented Sep 23, 2022

Benchmark:

    private static readonly byte[] httpHeader = Encoding.UTF8.GetBytes( 
""" 
Host: 127.0.0.1:5001 
Connection: keep-alive 
Cache-Control: max-age=0 
sec-ch-ua-mobile: ?0 
sec-ch-ua-platform: "Windows" 
Upgrade-Insecure-Requests: 1 
Sec-Fetch-Site: none 
Sec-Fetch-Mode: navigate 
Sec-Fetch-User: ?1 
Sec-Fetch-Dest: document 
Accept-Encoding: gzip, deflate, br 
Accept-Language: en-US,en;q=0.9 
"""); 
 
    [Benchmark] 
    public int CountHeaders() 
    { 
        ReadOnlySpan<byte> span = httpHeader.AsSpan(); 
        int newline = 0; 
        int count = 0; 
        while (newline != -1 && span.Length > newline) 
        { 
            span = span.Slice(newline + 1); 
            newline = span.IndexOfAny((byte)'\n', (byte)':'); // or just IndexOf((byte)'\n')
            count++; 
        } 
        return count; 
    } 
Method Job Toolchain Mean Error StdDev Ratio
CountHeaders Job-AWWEJK /Core_Root/corerun 193.7 ns 2.58 ns 2.41 ns 0.80
CountHeaders Job-DZXKTO /Core_Root_base/corerun 242.2 ns 1.11 ns 1.04 ns 1.00

@EgorBo EgorBo modified the milestones: Future, 8.0.0 Oct 24, 2022
@EgorBo
Copy link
Member Author

EgorBo commented Mar 29, 2023

So the task here is to optimize TrailingZeroCount(comparison.ExtractMostSignificantBits()) in:

static int FirstMatch_old(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(src, val);
    return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
}

to emit the same codegen as this function does:

static int FirstMatch_new(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(src, val);
    ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(matches) >> 2;
}

Very first task here is to move ExtractMostSignificantBits expansion from importer to lower. Currently, ExtractMostSignificantBits on ARM64 is imported as the following IR:

STMT00000 ( 0x000[E-] ... ??? )
               [000006] -A---------                         *  ASG       simd16 (copy)
               [000005] D------N---                         +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V03 tmp1         
               [000004] -----------                         \--*  HWINTRINSIC simd16 ubyte ShiftLogical
               [000003] -----------                            +--*  HWINTRINSIC simd16 ubyte And
               [000000] -----------                            |  +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V00 arg0         
               [000001] -----------                            |  \--*  CNS_VEC   simd16<0x80808080, 0x80808080, 0x80808080, 0x80808080>
               [000002] -----------                            \--*  CNS_VEC   simd16<0xfcfbfaf9, 0x00fffefd, 0xfcfbfaf9, 0x00fffefd>

    [ 1]   6 (0x006) ret

STMT00001 ( ??? ... ??? )
               [000023] -----------                         *  RETURN    int   
               [000022] -----------                         \--*  OR        int   
               [000012] ---------U-                            +--*  CAST      int <- uint
               [000011] -----------                            |  \--*  HWINTRINSIC ubyte  ubyte ToScalar
               [000010] -----------                            |     \--*  HWINTRINSIC simd8  ubyte AddAcross
               [000009] -----------                            |        \--*  HWINTRINSIC simd8  ubyte GetLower
               [000008] -----------                            |           \--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V03 tmp1         
               [000021] -----------                            \--*  LSH       int   
               [000019] ---------U-                               +--*  CAST      int <- uint
               [000018] -----------                               |  \--*  HWINTRINSIC ubyte  ubyte ToScalar
               [000017] -----------                               |     \--*  HWINTRINSIC simd8  ubyte AddAcross
               [000016] -----------                               |        \--*  HWINTRINSIC simd8  ubyte GetLower
               [000015] -----------                               |           \--*  HWINTRINSIC simd16 ubyte ExtractVector128
               [000007] -----------                               |              +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V03 tmp1         
               [000013] -----------                               |              +--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000>
               [000014] -----------                               |              \--*  CNS_INT   int    8
               [000020] -----------                               \--*  CNS_INT   int    8

Instead, it should be just * HWINTRINSIC simd16 int ExtractMostSignificantBits as is and is expanded to ^ in Lower. It will allow us to perform optimizations in VN on top of a single node instead of this huge tree.
Here is where it's expanded currently:

case NI_Vector128_ExtractMostSignificantBits:
{
assert(sig->numArgs == 1);
// ARM64 doesn't have a single instruction that performs the behavior so we'll emulate it instead.
// To do this, we effectively perform the following steps:
// 1. tmp = input & 0x80 ; and the input to clear all but the most significant bit
// 2. tmp = tmp >> index ; right shift each element by its index
// 3. tmp = sum(tmp) ; sum the elements together
// For byte/sbyte, we also need to handle the fact that we can only shift by up to 8
// but for Vector128, we have 16 elements to handle. In that scenario, we will simply
// extract both scalars, and combine them via: (upper << 8) | lower
var_types simdType = getSIMDTypeForSize(simdSize);
op1 = impSIMDPopStack(simdType);
GenTreeVecCon* vecCon2 = gtNewVconNode(simdType);
GenTreeVecCon* vecCon3 = gtNewVconNode(simdType);
switch (simdBaseType)
{
case TYP_BYTE:
case TYP_UBYTE:
{
simdBaseType = TYP_UBYTE;
simdBaseJitType = CORINFO_TYPE_UBYTE;
vecCon2->gtSimdVal.u64[0] = 0x8080808080808080;
vecCon3->gtSimdVal.u64[0] = 0x00FFFEFDFCFBFAF9;
if (simdSize == 16)
{
vecCon2->gtSimdVal.u64[1] = 0x8080808080808080;
vecCon3->gtSimdVal.u64[1] = 0x00FFFEFDFCFBFAF9;
}
break;
}
case TYP_SHORT:
case TYP_USHORT:
{
simdBaseType = TYP_USHORT;
simdBaseJitType = CORINFO_TYPE_USHORT;
vecCon2->gtSimdVal.u64[0] = 0x8000800080008000;
vecCon3->gtSimdVal.u64[0] = 0xFFF4FFF3FFF2FFF1;
if (simdSize == 16)
{
vecCon2->gtSimdVal.u64[1] = 0x8000800080008000;
vecCon3->gtSimdVal.u64[1] = 0xFFF8FFF7FFF6FFF5;
}
break;
}
case TYP_INT:
case TYP_UINT:
case TYP_FLOAT:
{
simdBaseType = TYP_INT;
simdBaseJitType = CORINFO_TYPE_INT;
vecCon2->gtSimdVal.u64[0] = 0x8000000080000000;
vecCon3->gtSimdVal.u64[0] = 0xFFFFFFE2FFFFFFE1;
if (simdSize == 16)
{
vecCon2->gtSimdVal.u64[1] = 0x8000000080000000;
vecCon3->gtSimdVal.u64[1] = 0xFFFFFFE4FFFFFFE3;
}
break;
}
case TYP_LONG:
case TYP_ULONG:
case TYP_DOUBLE:
{
simdBaseType = TYP_LONG;
simdBaseJitType = CORINFO_TYPE_LONG;
vecCon2->gtSimdVal.u64[0] = 0x8000000000000000;
vecCon3->gtSimdVal.u64[0] = 0xFFFFFFFFFFFFFFC1;
if (simdSize == 16)
{
vecCon2->gtSimdVal.u64[1] = 0x8000000000000000;
vecCon3->gtSimdVal.u64[1] = 0xFFFFFFFFFFFFFFC2;
}
break;
}
default:
{
unreached();
}
}
op3 = vecCon3;
op2 = vecCon2;
op1 = gtNewSimdHWIntrinsicNode(simdType, op1, op2, NI_AdvSimd_And, simdBaseJitType, simdSize,
/* isSimdAsHWIntrinsic */ false);
NamedIntrinsic shiftIntrinsic = NI_AdvSimd_ShiftLogical;
if ((simdSize == 8) && varTypeIsLong(simdBaseType))
{
shiftIntrinsic = NI_AdvSimd_ShiftLogicalScalar;
}
op1 = gtNewSimdHWIntrinsicNode(simdType, op1, op3, shiftIntrinsic, simdBaseJitType, simdSize,
/* isSimdAsHWIntrinsic */ false);
if (varTypeIsByte(simdBaseType) && (simdSize == 16))
{
CORINFO_CLASS_HANDLE simdClsHnd = gtGetStructHandleForSimdOrHW(TYP_SIMD16, simdBaseJitType);
op1 = impCloneExpr(op1, &op2, simdClsHnd, CHECK_SPILL_ALL,
nullptr DEBUGARG("Clone op1 for vector extractmostsignificantbits"));
op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_Vector128_GetLower, simdBaseJitType, simdSize,
/* isSimdAsHWIntrinsic */ false);
op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_AdvSimd_Arm64_AddAcross, simdBaseJitType, 8,
/* isSimdAsHWIntrinsic */ false);
op1 = gtNewSimdHWIntrinsicNode(simdBaseType, op1, NI_Vector64_ToScalar, simdBaseJitType, 8,
/* isSimdAsHWIntrinsic */ false);
op1 = gtNewCastNode(TYP_INT, op1, /* isUnsigned */ true, TYP_INT);
GenTree* zero = gtNewZeroConNode(TYP_SIMD16);
ssize_t index = 8 / genTypeSize(simdBaseType);
op2 = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op2, zero, gtNewIconNode(index), NI_AdvSimd_ExtractVector128,
simdBaseJitType, simdSize, /* isSimdAsHWIntrinsic */ false);
op2 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op2, NI_Vector128_GetLower, simdBaseJitType, simdSize,
/* isSimdAsHWIntrinsic */ false);
op2 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op2, NI_AdvSimd_Arm64_AddAcross, simdBaseJitType, 8,
/* isSimdAsHWIntrinsic */ false);
op2 = gtNewSimdHWIntrinsicNode(simdBaseType, op2, NI_Vector64_ToScalar, simdBaseJitType, 8,
/* isSimdAsHWIntrinsic */ false);
op2 = gtNewCastNode(TYP_INT, op2, /* isUnsigned */ true, TYP_INT);
op2 = gtNewOperNode(GT_LSH, TYP_INT, op2, gtNewIconNode(8));
retNode = gtNewOperNode(GT_OR, TYP_INT, op1, op2);
}
else
{
if (!varTypeIsLong(simdBaseType))
{
if ((simdSize == 8) && ((simdBaseType == TYP_INT) || (simdBaseType == TYP_UINT)))
{
CORINFO_CLASS_HANDLE simdClsHnd = gtGetStructHandleForSimdOrHW(simdType, simdBaseJitType);
op1 = impCloneExpr(op1, &op2, simdClsHnd, CHECK_SPILL_ALL,
nullptr DEBUGARG("Clone op1 for vector extractmostsignificantbits"));
op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, op2, NI_AdvSimd_AddPairwise, simdBaseJitType,
simdSize, /* isSimdAsHWIntrinsic */ false);
}
else
{
op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_AdvSimd_Arm64_AddAcross, simdBaseJitType,
simdSize, /* isSimdAsHWIntrinsic */ false);
}
}
else if (simdSize == 16)
{
op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_AdvSimd_Arm64_AddPairwiseScalar, simdBaseJitType,
simdSize, /* isSimdAsHWIntrinsic */ false);
}
retNode = gtNewSimdHWIntrinsicNode(simdBaseType, op1, NI_Vector64_ToScalar, simdBaseJitType, 8,
/* isSimdAsHWIntrinsic */ false);
if ((simdBaseType != TYP_INT) && (simdBaseType != TYP_UINT))
{
retNode = gtNewCastNode(TYP_INT, retNode, /* isUnsigned */ true, TYP_INT);
}
}
break;
}
- it should be moved to Lower. Then the next step is simple to recognize a the final shape we want to opimize in VN.

cc @JulieLeeMSFT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

2 participants