-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Vector128.ExtractMostSignificantBits for arm64 #76047
Comments
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsPer discussion with @tannergooding on Discord
static void PrintPostion()
{
Vector128<byte> src = Vector128.Create((byte)0, 0, 0, 42, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
Vector128<byte> val = Vector128.Create((byte)42);
// prints 3 as the index of 42 is "3" in src vector
Console.WriteLine(FirstMatch(src, val));
}
static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
Vector128<byte> eq = Vector128.Equals(src, val);
return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
} Codegen for vmovupd xmm0, xmmword ptr [rcx]
vpcmpeqb xmm0, xmm0, xmmword ptr [rdx]
vpmovmskb eax, xmm0
tzcnt eax, eax Codegen for cmeq v16.16b, v0.16b, v1.16b
ldr q17, [@RWD00]
and v16.16b, v16.16b, v17.16b
ldr q17, [@RWD16]
ushl v16.16b, v16.16b, v17.16b
movi v17.4s, #0
ext v17.16b, v16.16b, v17.16b, #8
addv b17, v17.8b
umov w0, v17.b[0]
lsl w0, w0, #8
addv b16, v16.8b
umov w1, v16.b[0]
orr w0, w0, w1
rbit w0, w0
clz w0, w0 Because arm64 doesn't have a direct equivalent of cmeq v16.16b, v0.16b, v1.16b
shrn v16.8b, v16.8h, #4
umov x0, v16.d[0]
rbit x0, x0
clz x0, x0
asr w0, w0, #2 its C# equivalent: static int FirstMatch(Vector128<byte> src, Vector128<byte> val)
{
Vector128<byte> eq = Vector128.Equals(vector, val);
ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
return BitOperations.TrailingZeroCount(matches) >> 2;
} Performance impactWe expect a nice improvement for small inputs like what http parsing typically sees where we have to find positions of symbols like
|
Benchmark: private static readonly byte[] httpHeader = Encoding.UTF8.GetBytes(
"""
Host: 127.0.0.1:5001
Connection: keep-alive
Cache-Control: max-age=0
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Upgrade-Insecure-Requests: 1
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
""");
[Benchmark]
public int CountHeaders()
{
ReadOnlySpan<byte> span = httpHeader.AsSpan();
int newline = 0;
int count = 0;
while (newline != -1 && span.Length > newline)
{
span = span.Slice(newline + 1);
newline = span.IndexOfAny((byte)'\n', (byte)':'); // or just IndexOf((byte)'\n')
count++;
}
return count;
}
|
So the task here is to optimize static int FirstMatch_old(Vector128<byte> src, Vector128<byte> val)
{
Vector128<byte> eq = Vector128.Equals(src, val);
return BitOperations.TrailingZeroCount(eq.ExtractMostSignificantBits());
} to emit the same codegen as this function does: static int FirstMatch_new(Vector128<byte> src, Vector128<byte> val)
{
Vector128<byte> eq = Vector128.Equals(src, val);
ulong matches = AdvSimd.ShiftRightLogicalNarrowingLower(src.AsUInt16(), 4).AsUInt64().ToScalar();
return BitOperations.TrailingZeroCount(matches) >> 2;
} Very first task here is to move
Instead, it should be just runtime/src/coreclr/jit/hwintrinsicarm64.cpp Lines 943 to 1122 in 91df184
|
Per discussion with @tannergooding on Discord
Vector128.ExtractMostSignificantBits
is quite an important API that is typically used together with comparisons andTrailingZeroCount/LeadingZeroCount
to detect positions of an element in a vector - typically used in variousIndexOf
-like algorithms, etc. Example:Codegen for
FirstMatch
on x64:Codegen for
FirstMatch
on arm64:Because arm64 doesn't have a direct equivalent of
movmsk
. However, this particular case can be optimized because we know that input ofExtractMostSignificantBits
is a comparison's result with all elements being either zero or all-bits-set, in the best case we can perform this smart trick from arm blog by @danlark1its C# equivalent:
Performance impact
We expect a nice improvement for small inputs like what http parsing typically sees where we have to find positions of symbols like
:
,\n
etc in relatively small inputs. For large inputs in most of our algorithms we use fastcompare == Vector128<>.Zero
checks to ignore chunks without matches.category:cq
theme:vector-codegen
skill-level:intermediate
cost:small
impact:small
The text was updated successfully, but these errors were encountered: