-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop consuming result of GetPinnedReference() #1280
Conversation
We were consuming the values returned from `GetPinnedReference()` API, however `.Consume()` uses `volatile` and that introduces memory barriers for ARM64. Since we just want to measure the performance of `GetPinnedReference()` it is unnecessary to introduce a differentiating factor for 1 architecture and not the other. Hence I have removed the `.Consume()` and instead just storing the returned result in a variable. Before: ```asm ... G_M52573_IG23: 79400063 ldrh w3, [x3] D5033BBF dmb ish 79008023 strh w3, [x1,dotnet#64] D2800003 mov x3, #0 34000040 cbz w0, G_M52573_IG25 ;; bbWeight=1 PerfScore 15.50 G_M52573_IG24: AA0203E3 mov x3, x2 ;; bbWeight=0.25 PerfScore 0.13 G_M52573_IG25: 79400063 ldrh w3, [x3] D5033BBF dmb ish 79008023 strh w3, [x1,dotnet#64] D2800003 mov x3, #0 34000040 cbz w0, G_M52573_IG27 ;; bbWeight=1 PerfScore 15.50 G_M52573_IG26: AA0203E3 mov x3, x2 ;; bbWeight=0.25 PerfScore 0.13 ... ``` After ```asm ... G_M51552_IG23: 79400021 ldrh w1, [x1] D2800001 mov x1, #0 34000040 cbz w0, G_M51552_IG25 ;; bbWeight=1 PerfScore 4.50 G_M51552_IG24: AA0203E1 mov x1, x2 ;; bbWeight=0.25 PerfScore 0.13 G_M51552_IG25: 79400021 ldrh w1, [x1] D2800001 mov x1, #0 34000040 cbz w0, G_M51552_IG27 ;; bbWeight=1 PerfScore 4.50 G_M51552_IG26: AA0203E1 mov x1, x2 ;; bbWeight=0.25 PerfScore 0.13 ... ``` This change reduces the ARM64 numbers for this benchmark from 10ns to 1ns. I am not sure if we should just add `Consume()` method and mark it as `NoInline`. That's what is done for `System.Memory.Span<T>.GetPinnedReference()` benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to avoid changing the benchmarks, but in this case, it makes perfect sense: if the time went from 10ns to 1ns it means that so far for ARM we were benchmarking the memory barriers, not GetPinnableReference
.
Overall looks good to me, but I've found two things that we need to change before merging.
BTW @kunalspathak is this the only nanobenchmark that is using Consumer.Consume
?
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | ||
c = span.GetPinnableReference(); c = span.GetPinnableReference(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a possibility that in the future JIT is going to replace all 32 calls with a single one?
@EgorBo or perhaps it can happen as of today with Mono using LLVM?
If so, we should most probably xor the results
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c = span.GetPinnableReference(); | |
c = span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); | |
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamsitnik I've just checked - Mono-LLVM optimizes out all the dead stores (basicaly, does span.GetPinnableReference()
only once). The "^=" trick doesn't help 🙂 (with the proposed change it optimized the whole function to return 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EgorBo thank you! Do you know how can we trick the Mono-LLVM ? Mix +
with ^
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, GetPinnableReference
is just too simple and returns a pointer. I guess only if you hide span
to a non-inlineable method e.g. GetSpan().GetPinnableReference() ^ ..
but I guess the overhead of that method will be bigger than the actual GetPinnableReference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird. Surprisingly RYUJit doesn't eliminate dead stores (at least for this example). The code that I pasted above is precisely that gets generated today. I will investigate and possibly open an issue to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated existing issue with my comments. dotnet/runtime#13727 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can do something like this?
private static void Consume(in System.Span<T> _) { } |
But yeah, that would have method call overhead too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such case, I think that I am going to merge it as it is.
If it becomes a problem in the future, we might consider removing this benchmark because it looks like it might be impossible to properly benchmark this method (it's too simple).
I agree with @adamsitnik that the value of this benchmark is likely to be very small. What's the real regression risk for this method? As @EgorBo points out the method is pretty trivial as is. Is that ever likely to change? (I imagine not.) I'm fine with the change as is but if this becomes problematic in the future (or becomes impossible to filter out of our results if it is noisy) we should indeed just get rid of it. |
We were consuming the values returned from
GetPinnedReference()
API, however.Consume()
usesvolatile
and that introduces memory barriers for ARM64.Since we just want to measure the performance of
GetPinnedReference()
it isunnecessary to introduce a differentiating factor for 1 architecture and not the
other. Hence I have removed the
.Consume()
and instead just storing the returnedresult in a variable.
Before:
After
This change reduces the ARM64 numbers for this benchmark from 10ns to 1ns. I am not sure if
we should just add
Consume()
method and mark it asNoInline
. That's what is done forSystem.Memory.Span<T>.GetPinnedReference()
benchmark. I have also increased the number of calls toGetPinnableReference()
to make sure we can a valid time measurement value.