Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop consuming result of GetPinnedReference() #1280

Merged
merged 2 commits into from
Apr 15, 2020

Conversation

kunalspathak
Copy link
Member

We were consuming the values returned from GetPinnedReference() API, however
.Consume() uses volatile and that introduces memory barriers for ARM64.
Since we just want to measure the performance of GetPinnedReference() it is
unnecessary to introduce a differentiating factor for 1 architecture and not the
other. Hence I have removed the .Consume() and instead just storing the returned
result in a variable.

Before:

...
G_M52573_IG23:
        79400063          ldrh    w3, [x3]
        D5033BBF          dmb     ish
        79008023          strh    w3, [x1,#64]
        D2800003          mov     x3, #0
        34000040          cbz     w0, G_M52573_IG25
                                                ;; bbWeight=1    PerfScore 15.50
G_M52573_IG24:
        AA0203E3          mov     x3, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
G_M52573_IG25:
        79400063          ldrh    w3, [x3]
        D5033BBF          dmb     ish
        79008023          strh    w3, [x1,#64]
        D2800003          mov     x3, #0
        34000040          cbz     w0, G_M52573_IG27
                                                ;; bbWeight=1    PerfScore 15.50
G_M52573_IG26:
        AA0203E3          mov     x3, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
...

After

...
G_M51552_IG23:
        79400021          ldrh    w1, [x1]
        D2800001          mov     x1, #0
        34000040          cbz     w0, G_M51552_IG25
                                                ;; bbWeight=1    PerfScore 4.50
G_M51552_IG24:
        AA0203E1          mov     x1, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
G_M51552_IG25:
        79400021          ldrh    w1, [x1]
        D2800001          mov     x1, #0
        34000040          cbz     w0, G_M51552_IG27
                                                ;; bbWeight=1    PerfScore 4.50
G_M51552_IG26:
        AA0203E1          mov     x1, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
...

This change reduces the ARM64 numbers for this benchmark from 10ns to 1ns. I am not sure if
we should just add Consume() method and mark it as NoInline. That's what is done for
System.Memory.Span<T>.GetPinnedReference() benchmark. I have also increased the number of calls to GetPinnableReference() to make sure we can a valid time measurement value.

We were consuming the values returned from `GetPinnedReference()` API, however
`.Consume()` uses `volatile` and that introduces memory barriers for ARM64.
Since we just want to measure the performance of `GetPinnedReference()` it is
unnecessary to introduce a differentiating factor for 1 architecture and not the
other. Hence I have removed the `.Consume()` and instead just storing the returned
result in a variable.

Before:
```asm
...
G_M52573_IG23:
        79400063          ldrh    w3, [x3]
        D5033BBF          dmb     ish
        79008023          strh    w3, [x1,dotnet#64]
        D2800003          mov     x3, #0
        34000040          cbz     w0, G_M52573_IG25
                                                ;; bbWeight=1    PerfScore 15.50
G_M52573_IG24:
        AA0203E3          mov     x3, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
G_M52573_IG25:
        79400063          ldrh    w3, [x3]
        D5033BBF          dmb     ish
        79008023          strh    w3, [x1,dotnet#64]
        D2800003          mov     x3, #0
        34000040          cbz     w0, G_M52573_IG27
                                                ;; bbWeight=1    PerfScore 15.50
G_M52573_IG26:
        AA0203E3          mov     x3, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
...

```

After
```asm
...
G_M51552_IG23:
        79400021          ldrh    w1, [x1]
        D2800001          mov     x1, #0
        34000040          cbz     w0, G_M51552_IG25
                                                ;; bbWeight=1    PerfScore 4.50
G_M51552_IG24:
        AA0203E1          mov     x1, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
G_M51552_IG25:
        79400021          ldrh    w1, [x1]
        D2800001          mov     x1, #0
        34000040          cbz     w0, G_M51552_IG27
                                                ;; bbWeight=1    PerfScore 4.50
G_M51552_IG26:
        AA0203E1          mov     x1, x2
                                                ;; bbWeight=0.25 PerfScore 0.13
...

```

This change reduces the ARM64 numbers for this benchmark from 10ns to 1ns. I am not sure if
we should just add `Consume()` method and mark it as `NoInline`. That's what is done for
`System.Memory.Span<T>.GetPinnedReference()` benchmark.
@kunalspathak kunalspathak changed the title Stop consuming pinned references Stop consuming result of GetPinnedReference() Apr 13, 2020
@BruceForstall
Copy link
Member

@adamsitnik @billwert

@adamsitnik adamsitnik self-requested a review April 14, 2020 08:02
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid changing the benchmarks, but in this case, it makes perfect sense: if the time went from 10ns to 1ns it means that so far for ARM we were benchmarking the memory barriers, not GetPinnableReference.

Overall looks good to me, but I've found two things that we need to change before merging.

BTW @kunalspathak is this the only nanobenchmark that is using Consumer.Consume?

Comment on lines +26 to +41
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a possibility that in the future JIT is going to replace all 32 calls with a single one?

@EgorBo or perhaps it can happen as of today with Mono using LLVM?

If so, we should most probably xor the results

Suggested change
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c = span.GetPinnableReference();
c = span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();
c ^= span.GetPinnableReference(); c ^= span.GetPinnableReference();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamsitnik I've just checked - Mono-LLVM optimizes out all the dead stores (basicaly, does span.GetPinnableReference() only once). The "^=" trick doesn't help 🙂 (with the proposed change it optimized the whole function to return 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EgorBo thank you! Do you know how can we trick the Mono-LLVM ? Mix + with ^ ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, GetPinnableReference is just too simple and returns a pointer. I guess only if you hide span to a non-inlineable method e.g. GetSpan().GetPinnableReference() ^ .. but I guess the overhead of that method will be bigger than the actual GetPinnableReference

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird. Surprisingly RYUJit doesn't eliminate dead stores (at least for this example). The code that I pasted above is precisely that gets generated today. I will investigate and possibly open an issue to track this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated existing issue with my comments. dotnet/runtime#13727 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can do something like this?

private static void Consume(in System.Span<T> _) { }

But yeah, that would have method call overhead too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case, I think that I am going to merge it as it is.

If it becomes a problem in the future, we might consider removing this benchmark because it looks like it might be impossible to properly benchmark this method (it's too simple).

/cc @billwert @DrewScoggins

@adamsitnik adamsitnik merged commit a6955b7 into dotnet:master Apr 15, 2020
@billwert
Copy link
Member

I agree with @adamsitnik that the value of this benchmark is likely to be very small. What's the real regression risk for this method? As @EgorBo points out the method is pretty trivial as is. Is that ever likely to change? (I imagine not.)

I'm fine with the change as is but if this becomes problematic in the future (or becomes impossible to filter out of our results if it is noisy) we should indeed just get rid of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants