Stop consuming result of GetPinnedReference() #1280

kunalspathak · 2020-04-13T04:58:32Z

We were consuming the values returned from GetPinnedReference() API, however
.Consume() uses volatile and that introduces memory barriers for ARM64.
Since we just want to measure the performance of GetPinnedReference() it is
unnecessary to introduce a differentiating factor for 1 architecture and not the
other. Hence I have removed the .Consume() and instead just storing the returned
result in a variable.

Before:

...
G_M52573_IG23:
        79400063          ldrh    w3, [x3]
        D5033BBF          dmb     ish
        79008023          strh    w3, [x1,#64]
        D2800003          mov     x3, #0
        34000040          cbz     w0, G_M52573_IG25
                                                ;; bbWeight=1    PerfScore 15.50
G_M52573_IG24:
        AA0203E3          mov     x3, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
G_M52573_IG25:
        79400063          ldrh    w3, [x3]
        D5033BBF          dmb     ish
        79008023          strh    w3, [x1,#64]
        D2800003          mov     x3, #0
        34000040          cbz     w0, G_M52573_IG27
                                                ;; bbWeight=1    PerfScore 15.50
G_M52573_IG26:
        AA0203E3          mov     x3, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
...

After

...
G_M51552_IG23:
        79400021          ldrh    w1, [x1]
        D2800001          mov     x1, #0
        34000040          cbz     w0, G_M51552_IG25
                                                ;; bbWeight=1    PerfScore 4.50
G_M51552_IG24:
        AA0203E1          mov     x1, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
G_M51552_IG25:
        79400021          ldrh    w1, [x1]
        D2800001          mov     x1, #0
        34000040          cbz     w0, G_M51552_IG27
                                                ;; bbWeight=1    PerfScore 4.50
G_M51552_IG26:
        AA0203E1          mov     x1, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
...

This change reduces the ARM64 numbers for this benchmark from 10ns to 1ns. I am not sure if
we should just add Consume() method and mark it as NoInline. That's what is done for
System.Memory.Span<T>.GetPinnedReference() benchmark. I have also increased the number of calls to GetPinnableReference() to make sure we can a valid time measurement value.

We were consuming the values returned from `GetPinnedReference()` API, however `.Consume()` uses `volatile` and that introduces memory barriers for ARM64. Since we just want to measure the performance of `GetPinnedReference()` it is unnecessary to introduce a differentiating factor for 1 architecture and not the other. Hence I have removed the `.Consume()` and instead just storing the returned result in a variable. Before: ```asm ... G_M52573_IG23: 79400063 ldrh w3, [x3] D5033BBF dmb ish 79008023 strh w3, [x1,dotnet#64] D2800003 mov x3, #0 34000040 cbz w0, G_M52573_IG25 ;; bbWeight=1 PerfScore 15.50 G_M52573_IG24: AA0203E3 mov x3, x2 ;; bbWeight=0.25 PerfScore 0.13 G_M52573_IG25: 79400063 ldrh w3, [x3] D5033BBF dmb ish 79008023 strh w3, [x1,dotnet#64] D2800003 mov x3, #0 34000040 cbz w0, G_M52573_IG27 ;; bbWeight=1 PerfScore 15.50 G_M52573_IG26: AA0203E3 mov x3, x2 ;; bbWeight=0.25 PerfScore 0.13 ... ``` After ```asm ... G_M51552_IG23: 79400021 ldrh w1, [x1] D2800001 mov x1, #0 34000040 cbz w0, G_M51552_IG25 ;; bbWeight=1 PerfScore 4.50 G_M51552_IG24: AA0203E1 mov x1, x2 ;; bbWeight=0.25 PerfScore 0.13 G_M51552_IG25: 79400021 ldrh w1, [x1] D2800001 mov x1, #0 34000040 cbz w0, G_M51552_IG27 ;; bbWeight=1 PerfScore 4.50 G_M51552_IG26: AA0203E1 mov x1, x2 ;; bbWeight=0.25 PerfScore 0.13 ... ``` This change reduces the ARM64 numbers for this benchmark from 10ns to 1ns. I am not sure if we should just add `Consume()` method and mark it as `NoInline`. That's what is done for `System.Memory.Span<T>.GetPinnedReference()` benchmark.

BruceForstall · 2020-04-13T18:25:43Z

@adamsitnik @billwert

adamsitnik

We try to avoid changing the benchmarks, but in this case, it makes perfect sense: if the time went from 10ns to 1ns it means that so far for ARM we were benchmarking the memory barriers, not GetPinnableReference.

Overall looks good to me, but I've found two things that we need to change before merging.

BTW @kunalspathak is this the only nanobenchmark that is using Consumer.Consume?

src/benchmarks/micro/libraries/System.Memory/ReadOnlySpan.cs

adamsitnik · 2020-04-14T08:01:52Z

src/benchmarks/micro/libraries/System.Memory/ReadOnlySpan.cs

+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();
+            c = span.GetPinnableReference(); c = span.GetPinnableReference();


is there a possibility that in the future JIT is going to replace all 32 calls with a single one?

@EgorBo or perhaps it can happen as of today with Mono using LLVM?

If so, we should most probably xor the results

Suggested change

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c = span.GetPinnableReference();

c = span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

@adamsitnik I've just checked - Mono-LLVM optimizes out all the dead stores (basicaly, does span.GetPinnableReference() only once). The "^=" trick doesn't help 🙂 (with the proposed change it optimized the whole function to return 0

@EgorBo thank you! Do you know how can we trick the Mono-LLVM ? Mix + with ^ ?

No idea, GetPinnableReference is just too simple and returns a pointer. I guess only if you hide span to a non-inlineable method e.g. GetSpan().GetPinnableReference() ^ .. but I guess the overhead of that method will be bigger than the actual GetPinnableReference

Weird. Surprisingly RYUJit doesn't eliminate dead stores (at least for this example). The code that I pasted above is precisely that gets generated today. I will investigate and possibly open an issue to track this.

Updated existing issue with my comments. dotnet/runtime#13727 (comment)

Perhaps we can do something like this?

performance/src/benchmarks/micro/libraries/System.Memory/Slice.cs

Line 293 in 8aed638

private static void Consume(in System.Span<T> _) { }

But yeah, that would have method call overhead too.

In such case, I think that I am going to merge it as it is.

If it becomes a problem in the future, we might consider removing this benchmark because it looks like it might be impossible to properly benchmark this method (it's too simple).

/cc @billwert @DrewScoggins

billwert · 2020-04-15T18:00:35Z

I agree with @adamsitnik that the value of this benchmark is likely to be very small. What's the real regression risk for this method? As @EgorBo points out the method is pretty trivial as is. Is that ever likely to change? (I imagine not.)

I'm fine with the change as is but if this becomes problematic in the future (or becomes impossible to filter out of our results if it is noisy) we should indeed just get rid of it.

kunalspathak changed the title ~~Stop consuming pinned references~~ Stop consuming result of GetPinnedReference() Apr 13, 2020

adamsitnik self-requested a review April 14, 2020 08:02

adamsitnik requested changes Apr 14, 2020

View reviewed changes

Update src/benchmarks/micro/libraries/System.Memory/ReadOnlySpan.cs

02db979

adamsitnik approved these changes Apr 15, 2020

View reviewed changes

adamsitnik merged commit a6955b7 into dotnet:master Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop consuming result of GetPinnedReference() #1280

Stop consuming result of GetPinnedReference() #1280

kunalspathak commented Apr 13, 2020

BruceForstall commented Apr 13, 2020

adamsitnik left a comment

adamsitnik Apr 14, 2020

EgorBo Apr 14, 2020

adamsitnik Apr 14, 2020

EgorBo Apr 14, 2020

kunalspathak Apr 14, 2020

kunalspathak Apr 14, 2020

kunalspathak Apr 14, 2020

adamsitnik Apr 15, 2020

billwert commented Apr 15, 2020

Stop consuming result of GetPinnedReference() #1280

Stop consuming result of GetPinnedReference() #1280

Conversation

kunalspathak commented Apr 13, 2020

BruceForstall commented Apr 13, 2020

adamsitnik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

billwert commented Apr 15, 2020