Improve Vector128.ExtractMostSignificantBits for arm64 #76047

EgorBo · 2022-09-23T00:16:07Z

Per discussion with @tannergooding on Discord

Vector128.ExtractMostSignificantBits is quite an important API that is typically used together with comparisons and TrailingZeroCount/LeadingZeroCount to detect positions of an element in a vector - typically used in various IndexOf-like algorithms, etc. Example:

static void PrintPostion()
{
    Vector128<byte> src = Vector128.Create((byte)0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
    Vector128<byte> val = Vector128.Create((byte)42);

    // prints 3 as the index of 42 is "3" in src vector
    Console.WriteLine(FirstMatch(src, val)); 
}

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(src, val);
    return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
}

Codegen for FirstMatch on x64:

       vpcmpeqb  xmm0, xmm0, xmm1
       vpmovmskb eax, xmm0
       tzcnt     eax, eax

Codegen for FirstMatch on arm64:

            cmeq    v16.16b, v0.16b, v1.16b
            ldr     q17, [@RWD00]
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD16]
            ushl    v16.16b, v16.16b, v17.16b
            movi    v17.4s, #0
            ext     v17.16b, v16.16b, v17.16b, #8
            addv    b17, v17.8b
            umov    w0, v17.b[0]
            lsl     w0, w0, #8
            addv    b16, v16.8b
            umov    w1, v16.b[0]
            orr     w0, w0, w1
            rbit    w0, w0
            clz     w0, w0

Because arm64 doesn't have a direct equivalent of movmsk. However, this particular case can be optimized because we know that input of ExtractMostSignificantBits is a comparison's result with all elements being either zero or all-bits-set, in the best case we can perform this smart trick from arm blog by @danlark1

            cmeq    v16.16b, v0.16b, v1.16b
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x0, x0
            clz     x0, x0
            asr     w0, w0, #2

its C# equivalent:

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(vector, val);
    ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(matches) >> 2;
}

Performance impact

We expect a nice improvement for small inputs like what http parsing typically sees where we have to find positions of symbols like :, \n etc in relatively small inputs. For large inputs in most of our algorithms we use fast compare == Vector128<>.Zero checks to ignore chunks without matches.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:small
impact:small

The text was updated successfully, but these errors were encountered:

ghost · 2022-09-23T00:16:17Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Per discussion with @tannergooding on Discord

Vector128.ExtractMostSignificantBits is quite an important API that is typically used together with comparisons and TrailingZeroCount/LeadingZeroCount to detect positions of an element in a vector - typically used in various IndexOf-like algorithms, etc. Example:

static void PrintPostion()
{
    Vector128<byte> src = Vector128.Create((byte)0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
    Vector128<byte> val = Vector128.Create((byte)42);

    // prints 3 as the index of 42 is "3" in src vector
    Console.WriteLine(FirstMatch(src, val)); 
}

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(src, val);
    return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
}

Codegen for FirstMatch on x64:

       vmovupd  xmm0, xmmword ptr [rcx]
       vpcmpeqb xmm0, xmm0, xmmword ptr [rdx]
       vpmovmskb eax, xmm0
       tzcnt    eax, eax

Codegen for FirstMatch on arm64:

            cmeq    v16.16b, v0.16b, v1.16b
            ldr     q17, [@RWD00]
            and     v16.16b, v16.16b, v17.16b
            ldr     q17, [@RWD16]
            ushl    v16.16b, v16.16b, v17.16b
            movi    v17.4s, #0
            ext     v17.16b, v16.16b, v17.16b, #8
            addv    b17, v17.8b
            umov    w0, v17.b[0]
            lsl     w0, w0, #8
            addv    b16, v16.8b
            umov    w1, v16.b[0]
            orr     w0, w0, w1
            rbit    w0, w0
            clz     w0, w0

Because arm64 doesn't have a direct equivalent of movmsk. However, this particular case can be optimized because we know that input of ExtractMostSignificantBits is a comparison's result with all elements being either zero or all-bits-set, in the best case we can perform this smart trick from arm blog by @danlark1

            cmeq    v16.16b, v0.16b, v1.16b
            shrn    v16.8b, v16.8h, #4
            umov    x0, v16.d[0]
            rbit    x0, x0
            clz     x0, x0
            asr     w0, w0, #2

its C# equivalent:

static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(vector, val);
    ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(matches) >> 2;
}

Performance impact

We expect a nice improvement for small inputs like what http parsing typically sees where we have to find positions of symbols like :, \n etc in relatively small inputs. For large inputs in most of our algorithms we use fast compare == Vector128<>.Zero checks to ignore chunks without matches.

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `untriaged`
Milestone:	-

EgorBo · 2022-09-23T14:54:34Z

Benchmark:

    private static readonly byte[] httpHeader = Encoding.UTF8.GetBytes( 
""" 
Host: 127.0.0.1:5001 
Connection: keep-alive 
Cache-Control: max-age=0 
sec-ch-ua-mobile: ?0 
sec-ch-ua-platform: "Windows" 
Upgrade-Insecure-Requests: 1 
Sec-Fetch-Site: none 
Sec-Fetch-Mode: navigate 
Sec-Fetch-User: ?1 
Sec-Fetch-Dest: document 
Accept-Encoding: gzip, deflate, br 
Accept-Language: en-US,en;q=0.9 
"""); 
 
    [Benchmark] 
    public int CountHeaders() 
    { 
        ReadOnlySpan<byte> span = httpHeader.AsSpan(); 
        int newline = 0; 
        int count = 0; 
        while (newline != -1 && span.Length > newline) 
        { 
            span = span.Slice(newline + 1); 
            newline = span.IndexOfAny((byte)'\n', (byte)':'); // or just IndexOf((byte)'\n')
            count++; 
        } 
        return count; 
    }

Method	Job	Toolchain	Mean	Error	StdDev	Ratio
CountHeaders	Job-AWWEJK	/Core_Root/corerun	193.7 ns	2.58 ns	2.41 ns	0.80
CountHeaders	Job-DZXKTO	/Core_Root_base/corerun	242.2 ns	1.11 ns	1.04 ns	1.00

EgorBo · 2023-03-29T15:14:09Z

So the task here is to optimize TrailingZeroCount(comparison.ExtractMostSignificantBits()) in:

static int FirstMatch_old(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(src, val);
    return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
}

to emit the same codegen as this function does:

static int FirstMatch_new(Vector128<byte> src, Vector128<byte> val)
{
    Vector128<byte> eq = Vector128.Equals(src, val);
    ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
    return BitOperations.TrailingZeroCount(matches) >> 2;
}

Very first task here is to move ExtractMostSignificantBits expansion from importer to lower. Currently, ExtractMostSignificantBits on ARM64 is imported as the following IR:

STMT00000 ( 0x000[E-] ... ??? )
               [000006] -A---------                         *  ASG       simd16 (copy)
               [000005] D------N---                         +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V03 tmp1         
               [000004] -----------                         \--*  HWINTRINSIC simd16 ubyte ShiftLogical
               [000003] -----------                            +--*  HWINTRINSIC simd16 ubyte And
               [000000] -----------                            |  +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V00 arg0         
               [000001] -----------                            |  \--*  CNS_VEC   simd16<0x80808080, 0x80808080, 0x80808080, 0x80808080>
               [000002] -----------                            \--*  CNS_VEC   simd16<0xfcfbfaf9, 0x00fffefd, 0xfcfbfaf9, 0x00fffefd>

    [ 1]   6 (0x006) ret

STMT00001 ( ??? ... ??? )
               [000023] -----------                         *  RETURN    int   
               [000022] -----------                         \--*  OR        int   
               [000012] ---------U-                            +--*  CAST      int <- uint
               [000011] -----------                            |  \--*  HWINTRINSIC ubyte  ubyte ToScalar
               [000010] -----------                            |     \--*  HWINTRINSIC simd8  ubyte AddAcross
               [000009] -----------                            |        \--*  HWINTRINSIC simd8  ubyte GetLower
               [000008] -----------                            |           \--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V03 tmp1         
               [000021] -----------                            \--*  LSH       int   
               [000019] ---------U-                               +--*  CAST      int <- uint
               [000018] -----------                               |  \--*  HWINTRINSIC ubyte  ubyte ToScalar
               [000017] -----------                               |     \--*  HWINTRINSIC simd8  ubyte AddAcross
               [000016] -----------                               |        \--*  HWINTRINSIC simd8  ubyte GetLower
               [000015] -----------                               |           \--*  HWINTRINSIC simd16 ubyte ExtractVector128
               [000007] -----------                               |              +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V03 tmp1         
               [000013] -----------                               |              +--*  CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000>
               [000014] -----------                               |              \--*  CNS_INT   int    8
               [000020] -----------                               \--*  CNS_INT   int    8

Instead, it should be just * HWINTRINSIC simd16 int ExtractMostSignificantBits as is and is expanded to ^ in Lower. It will allow us to perform optimizations in VN on top of a single node instead of this huge tree.
Here is where it's expanded currently:

runtime/src/coreclr/jit/hwintrinsicarm64.cpp

Lines 943 to 1122 in 91df184

    
           case NI_Vector128_ExtractMostSignificantBits: 
        
           { 
        
               assert(sig->numArgs == 1); 
        
               // ARM64 doesn't have a single instruction that performs the behavior so we'll emulate it instead. 
        
               // To do this, we effectively perform the following steps: 
        
               // 1. tmp = input & 0x80         ; and the input to clear all but the most significant bit 
        
               // 2. tmp = tmp >> index         ; right shift each element by its index 
        
               // 3. tmp = sum(tmp)             ; sum the elements together 
        
               // For byte/sbyte, we also need to handle the fact that we can only shift by up to 8 
        
               // but for Vector128, we have 16 elements to handle. In that scenario, we will simply 
        
               // extract both scalars, and combine them via: (upper << 8) | lower 
        
               var_types simdType = getSIMDTypeForSize(simdSize); 
        
               op1 = impSIMDPopStack(simdType); 
        
               GenTreeVecCon* vecCon2 = gtNewVconNode(simdType); 
        
               GenTreeVecCon* vecCon3 = gtNewVconNode(simdType); 
        
               switch (simdBaseType) 
        
               { 
        
                   case TYP_BYTE: 
        
                   case TYP_UBYTE: 
        
                   { 
        
                       simdBaseType    = TYP_UBYTE; 
        
                       simdBaseJitType = CORINFO_TYPE_UBYTE; 
        
                       vecCon2->gtSimdVal.u64[0] = 0x8080808080808080; 
        
                       vecCon3->gtSimdVal.u64[0] = 0x00FFFEFDFCFBFAF9; 
        
                       if (simdSize == 16) 
        
                       { 
        
                           vecCon2->gtSimdVal.u64[1] = 0x8080808080808080; 
        
                           vecCon3->gtSimdVal.u64[1] = 0x00FFFEFDFCFBFAF9; 
        
                       } 
        
                       break; 
        
                   } 
        
                   case TYP_SHORT: 
        
                   case TYP_USHORT: 
        
                   { 
        
                       simdBaseType    = TYP_USHORT; 
        
                       simdBaseJitType = CORINFO_TYPE_USHORT; 
        
                       vecCon2->gtSimdVal.u64[0] = 0x8000800080008000; 
        
                       vecCon3->gtSimdVal.u64[0] = 0xFFF4FFF3FFF2FFF1; 
        
                       if (simdSize == 16) 
        
                       { 
        
                           vecCon2->gtSimdVal.u64[1] = 0x8000800080008000; 
        
                           vecCon3->gtSimdVal.u64[1] = 0xFFF8FFF7FFF6FFF5; 
        
                       } 
        
                       break; 
        
                   } 
        
                   case TYP_INT: 
        
                   case TYP_UINT: 
        
                   case TYP_FLOAT: 
        
                   { 
        
                       simdBaseType    = TYP_INT; 
        
                       simdBaseJitType = CORINFO_TYPE_INT; 
        
                       vecCon2->gtSimdVal.u64[0] = 0x8000000080000000; 
        
                       vecCon3->gtSimdVal.u64[0] = 0xFFFFFFE2FFFFFFE1; 
        
                       if (simdSize == 16) 
        
                       { 
        
                           vecCon2->gtSimdVal.u64[1] = 0x8000000080000000; 
        
                           vecCon3->gtSimdVal.u64[1] = 0xFFFFFFE4FFFFFFE3; 
        
                       } 
        
                       break; 
        
                   } 
        
                   case TYP_LONG: 
        
                   case TYP_ULONG: 
        
                   case TYP_DOUBLE: 
        
                   { 
        
                       simdBaseType    = TYP_LONG; 
        
                       simdBaseJitType = CORINFO_TYPE_LONG; 
        
                       vecCon2->gtSimdVal.u64[0] = 0x8000000000000000; 
        
                       vecCon3->gtSimdVal.u64[0] = 0xFFFFFFFFFFFFFFC1; 
        
                       if (simdSize == 16) 
        
                       { 
        
                           vecCon2->gtSimdVal.u64[1] = 0x8000000000000000; 
        
                           vecCon3->gtSimdVal.u64[1] = 0xFFFFFFFFFFFFFFC2; 
        
                       } 
        
                       break; 
        
                   } 
        
                   default: 
        
                   { 
        
                       unreached(); 
        
                   } 
        
               } 
        
               op3 = vecCon3; 
        
               op2 = vecCon2; 
        
               op1 = gtNewSimdHWIntrinsicNode(simdType, op1, op2, NI_AdvSimd_And, simdBaseJitType, simdSize, 
        
                                              /* isSimdAsHWIntrinsic */ false); 
        
               NamedIntrinsic shiftIntrinsic = NI_AdvSimd_ShiftLogical; 
        
               if ((simdSize == 8) && varTypeIsLong(simdBaseType)) 
        
               { 
        
                   shiftIntrinsic = NI_AdvSimd_ShiftLogicalScalar; 
        
               } 
        
               op1 = gtNewSimdHWIntrinsicNode(simdType, op1, op3, shiftIntrinsic, simdBaseJitType, simdSize, 
        
                                              /* isSimdAsHWIntrinsic */ false); 
        
               if (varTypeIsByte(simdBaseType) && (simdSize == 16)) 
        
               { 
        
                   CORINFO_CLASS_HANDLE simdClsHnd = gtGetStructHandleForSimdOrHW(TYP_SIMD16, simdBaseJitType); 
        
                   op1 = impCloneExpr(op1, &op2, simdClsHnd, CHECK_SPILL_ALL, 
        
                                      nullptr DEBUGARG("Clone op1 for vector extractmostsignificantbits")); 
        
                   op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_Vector128_GetLower, simdBaseJitType, simdSize, 
        
                                                  /* isSimdAsHWIntrinsic */ false); 
        
                   op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_AdvSimd_Arm64_AddAcross, simdBaseJitType, 8, 
        
                                                  /* isSimdAsHWIntrinsic */ false); 
        
                   op1 = gtNewSimdHWIntrinsicNode(simdBaseType, op1, NI_Vector64_ToScalar, simdBaseJitType, 8, 
        
                                                  /* isSimdAsHWIntrinsic */ false); 
        
                   op1 = gtNewCastNode(TYP_INT, op1, /* isUnsigned */ true, TYP_INT); 
        
                   GenTree* zero  = gtNewZeroConNode(TYP_SIMD16); 
        
                   ssize_t  index = 8 / genTypeSize(simdBaseType); 
        
                   op2 = gtNewSimdHWIntrinsicNode(TYP_SIMD16, op2, zero, gtNewIconNode(index), NI_AdvSimd_ExtractVector128, 
        
                                                  simdBaseJitType, simdSize, /* isSimdAsHWIntrinsic */ false); 
        
                   op2 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op2, NI_Vector128_GetLower, simdBaseJitType, simdSize, 
        
                                                  /* isSimdAsHWIntrinsic */ false); 
        
                   op2 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op2, NI_AdvSimd_Arm64_AddAcross, simdBaseJitType, 8, 
        
                                                  /* isSimdAsHWIntrinsic */ false); 
        
                   op2 = gtNewSimdHWIntrinsicNode(simdBaseType, op2, NI_Vector64_ToScalar, simdBaseJitType, 8, 
        
                                                  /* isSimdAsHWIntrinsic */ false); 
        
                   op2 = gtNewCastNode(TYP_INT, op2, /* isUnsigned */ true, TYP_INT); 
        
                   op2     = gtNewOperNode(GT_LSH, TYP_INT, op2, gtNewIconNode(8)); 
        
                   retNode = gtNewOperNode(GT_OR, TYP_INT, op1, op2); 
        
               } 
        
               else 
        
               { 
        
                   if (!varTypeIsLong(simdBaseType)) 
        
                   { 
        
                       if ((simdSize == 8) && ((simdBaseType == TYP_INT) || (simdBaseType == TYP_UINT))) 
        
                       { 
        
                           CORINFO_CLASS_HANDLE simdClsHnd = gtGetStructHandleForSimdOrHW(simdType, simdBaseJitType); 
        
                           op1 = impCloneExpr(op1, &op2, simdClsHnd, CHECK_SPILL_ALL, 
        
                                              nullptr DEBUGARG("Clone op1 for vector extractmostsignificantbits")); 
        
                           op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, op2, NI_AdvSimd_AddPairwise, simdBaseJitType, 
        
                                                          simdSize, /* isSimdAsHWIntrinsic */ false); 
        
                       } 
        
                       else 
        
                       { 
        
                           op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_AdvSimd_Arm64_AddAcross, simdBaseJitType, 
        
                                                          simdSize, /* isSimdAsHWIntrinsic */ false); 
        
                       } 
        
                   } 
        
                   else if (simdSize == 16) 
        
                   { 
        
                       op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_AdvSimd_Arm64_AddPairwiseScalar, simdBaseJitType, 
        
                                                      simdSize, /* isSimdAsHWIntrinsic */ false); 
        
                   } 
        
                   retNode = gtNewSimdHWIntrinsicNode(simdBaseType, op1, NI_Vector64_ToScalar, simdBaseJitType, 8, 
        
                                                      /* isSimdAsHWIntrinsic */ false); 
        
                   if ((simdBaseType != TYP_INT) && (simdBaseType != TYP_UINT)) 
        
                   { 
        
                       retNode = gtNewCastNode(TYP_INT, retNode, /* isUnsigned */ true, TYP_INT); 
        
                   } 
        
               } 
        
               break; 
        
           }

- it should be moved to Lower. Then the next step is simple to recognize a the final shape we want to opimize in VN.

cc @JulieLeeMSFT

ghost added the untriaged New issue has not been triaged by the area owner label Sep 23, 2022

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 23, 2022

EgorBo added this to the Future milestone Sep 23, 2022

ghost removed the untriaged New issue has not been triaged by the area owner label Sep 23, 2022

EgorBo self-assigned this Sep 23, 2022

EgorBo modified the milestones: Future, 8.0.0 Oct 24, 2022

EgorBo mentioned this issue Jan 15, 2023

Add MemoryExtensions.Count #80662

Merged

JulieLeeMSFT assigned JulieLeeMSFT and unassigned EgorBo Mar 29, 2023

ivanpovazan mentioned this issue Apr 5, 2023

[mono][jit] Adding support for Vector128::ExtractMostSignificantBits intrinsic on ARM64 with miniJIT #84345

Merged

EgorBo mentioned this issue Jun 18, 2023

[ARM64] Vector128<T>.ExtractMostSignificantBits() + TrailingZeroCount has better alternative than MoveMask emulation #87735

Closed

neon-sunset mentioned this issue Jul 28, 2023

[ARM64] Use SHRN in SpanHelpers.CountValueType #89632

Closed

JulieLeeMSFT modified the milestones: 8.0.0, 9.0.0 Aug 2, 2023

neon-sunset mentioned this issue Oct 17, 2023

Arm64: Add SVE/SVE2 support in .NET 9 #93095

Closed

31 tasks

kunalspathak mentioned this issue Nov 7, 2023

Improving Arm64 Performance in .NET 9.0 #94464

Closed

13 tasks

neon-sunset mentioned this issue Nov 7, 2023

Introduce VectorXXX.GetMatchCount and VectorXXX.IndexOfFirstMatch helper methods to ensure optimal codegen for ARM64 and unrolled fallback paths #94472

Closed

JulieLeeMSFT modified the milestones: 9.0.0, 10.0.0 Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Vector128.ExtractMostSignificantBits for arm64 #76047

Improve Vector128.ExtractMostSignificantBits for arm64 #76047

EgorBo commented Sep 23, 2022 •

edited by BruceForstall

Loading

ghost commented Sep 23, 2022

Performance impact

EgorBo commented Sep 23, 2022 •

edited

Loading

EgorBo commented Mar 29, 2023

Improve Vector128.ExtractMostSignificantBits for arm64 #76047

Improve Vector128.ExtractMostSignificantBits for arm64 #76047

Comments

EgorBo commented Sep 23, 2022 • edited by BruceForstall Loading

Performance impact

ghost commented Sep 23, 2022

Performance impact

EgorBo commented Sep 23, 2022 • edited Loading

EgorBo commented Mar 29, 2023

EgorBo commented Sep 23, 2022 •

edited by BruceForstall

Loading

EgorBo commented Sep 23, 2022 •

edited

Loading