-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize WithUpper/WithLower with InsertSelectedScalar, SpanHelpers.Sequence APIs #38075
Conversation
Thanks @kunalspathak , the I am however surprised that you don't see a performance gain with
return when |
Thanks for taking a look @TamarChristinaArm
Yes.
I think you meant
Currently we don't combine 2 ldr from subsequent memory into ldp. I have opened #35132 and #35130 to track it. By the way, just in case you didn't see, we have an epic issue #35853 to track all the ARM64 optimization opportunities. You might want to check the issues in peephole optimization category, in case you have any inputs.
For original length of 21 bytes, after completing first 16 bytes comparison, it does another load of 16 bytes starting from Regarding See below the assembly output: The code size with my changes also increased. Here are the perf numbers:
Benchmark code: [Params(
-1, 0, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511)]
public int MisMatchIndex { get; set; }
[Benchmark(OperationsPerInvoke = Iters)]
public int SequenceCompareTo()
{
byte[] input1 = s_source;
byte[] input2 = input1.ToArray();
if (MisMatchIndex != -1)
{
input2[MisMatchIndex] = 5; // something other than MisMatchIndexValue.
}
Span<byte> span1 = new Span<byte>(input1);
ReadOnlySpan<byte> span2 = new ReadOnlySpan<byte>(input2);
int total = 0;
for (int i = 0; i < Iters; i++)
{
total += span1.SequenceCompareTo(span2);
}
return total;
}
[GlobalSetup]
public void Setup()
{
var input = Enumerable.Range(0, 512).Select(b => (byte)b);
s_source = input.Concat(input).ToArray();
} (Note: -1 means both spans had same contents). |
Ah thanks! I've subcribed to it.
Yes, though your Q-form loads are usually the slowest ones. And older cores can't dual issue them. Using the smallest possible loads for the unaligned access would be optimimal but probably not that critical.
Thanks!, that looks fine, I think the only gains you can get here is with LoadPairs, So what I mean is right now in order to process 32-bytes of data you do but with LDP you could do Where your first two ANDs are dual issued. I expect this to be faster but for this to work you'd need something like LoadPairVectors128. I suspect you also have some long term gains to get here (on the long term) based on addressing modes, and codegen here, e.g.
and initialize your loads then becomes loads with a constant offset and writeback, so you remove the addition, so something like
These should speed things up along with
Yeah that looks as good as it can get for 2 128 vectors, on older cores you may see some difference using |
As mentioned in #37139 , use
AdvSimd.Arm64.InsertSelectedScalar()
to better optimizeWithUpper()
,WithLower()
and henceVector28.Create(Vector64, Vector64)
.Also, added a note in
SpanHelpers.SequenceEqual()
andSpanHelpers.SequenceCompareTo()
that we are not optimizing it currently with ARM64 intrinsics because it gives similar (sometimes less) wins than the vectorized version.