Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: scalable profile counter mode #84427

Merged
merged 8 commits into from
Apr 9, 2023

Conversation

AndyAyersMS
Copy link
Member

Add an config option to use a "scalable" profile helper for edge counters, where we try and avoid contention by probablistic updates once the counter value exceeds some threshold (currently 8192). Using the current xorshift RNG this gives the counter a two-sigma accuracy of around +/- 2%.

The idea is loosely based on "Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13).

Also allow the scalable and interlocked profile modes to operate at the same time, recording two sets of counts per probe, so we can verify that this new mode is sufficiently accurate.

Add an config option to use a "scalable" profile helper for edge counters,
where we try and avoid contention by probablistic updates once the counter
value exceeds some threshold (currently 8192). Using the current xorshift
RNG this gives the counter a two-sigma accuracy of around +/- 2%.

The idea is loosely based on "Scalable Statistics Counters" by Dice, Lev,
and Moir (SPAA’13).

Also allow the scalable and interlocked profile modes to operate at the same
time, recording two sets of counts per probe, so we can verify that this new mode
is sufficiently accurate.
@ghost ghost assigned AndyAyersMS Apr 6, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 6, 2023
@AndyAyersMS
Copy link
Member Author

@EgorBo PTAL
cc @dotnet/jit-contrib

I am enabling by default here -- we could reconsider and just merge this disabled, but local testing looks quite promising.

Will cause significant diffs in instrumented code.

@EgorBo
Copy link
Member

EgorBo commented Apr 6, 2023

Nice!

unsigned int logCount = 0;
BitScanReverse(&logCount, count);

if (logCount >= 13)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you have a comment about 13, probably just point to the paper Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13). at least?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That paper won't help you understand why this value is 13, but the design note I added will.

@AndyAyersMS AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 6, 2023
@ghost
Copy link

ghost commented Apr 6, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Add an config option to use a "scalable" profile helper for edge counters, where we try and avoid contention by probablistic updates once the counter value exceeds some threshold (currently 8192). Using the current xorshift RNG this gives the counter a two-sigma accuracy of around +/- 2%.

The idea is loosely based on "Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13).

Also allow the scalable and interlocked profile modes to operate at the same time, recording two sets of counts per probe, so we can verify that this new mode is sufficiently accurate.

Author: AndyAyersMS
Assignees: AndyAyersMS
Labels:

area-CodeGen-coreclr, needs-area-label

Milestone: -

@AndyAyersMS AndyAyersMS removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 6, 2023
@AndyAyersMS
Copy link
Member Author

CI is not triggering the right tests. Going to try bouncing this.

@AndyAyersMS AndyAyersMS closed this Apr 6, 2023
@AndyAyersMS AndyAyersMS reopened this Apr 6, 2023
@EgorBo
Copy link
Member

EgorBo commented Apr 6, 2023

Two questions:

  1. Do you plan to do the same with class probes?
  2. Do we need 64bit probes, aren't 32bit counters enough with your algorithm?

@AndyAyersMS
Copy link
Member Author

Probably need to update something for NAOT since I added new jit helpers in the middle of the enum space.

@AndyAyersMS
Copy link
Member Author

Note SPMI can't help us here because I changed the JIT GUID. We will have to generate diffs/tp impact retrospectively by disabling this in a trial PR once it's in.

@AndyAyersMS
Copy link
Member Author

Two questions:

  1. Do you plan to do the same with class probes?
  2. Do we need 64bit probes, aren't 32bit counters enough with your algorithm?

The counters in class probes no longer count very high -- are you suggesting that they could?

I don't know if we really need the 64 bit capabilities for dynamci PGO, but I'd like to keep it viable in case we do need it sometime. Also I think we enable it in the optimization repo just in case we might overflow counts, since we force everything to stay in tier0.

@AndyAyersMS
Copy link
Member Author

There are no artifacts for the arm linux musl failure, so build analysis can't determine if it is a known issue. But the failure is in System.Threading.Tasks.Dataflow.Tests so it looks like it could be #80857.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 7, 2023

Going to run a few more tests locally before I merge, and pgo stress. Note some of those optional pipelines have known failures.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr pgo, runtime-coreclr pgostress, runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

{
int logCount = 31 - (int) uint.LeadingZeroCount(count);

if (logCount >= 13)
Copy link
Member

@EgorBo EgorBo Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a sort of fast path here? E.g.

if (count & 0x3FFFF) // is counter already large enough?
{
    int logCount = 31 - (int) uint.LeadingZeroCount(count);
    ..
}

e.g. LZC is not cheap on arm since there is no scalar version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, we can probably inline it in JIT codegen if needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You confused me for a second commenting on the version in the doc and not the code in the runtime itself.

Yes we can do some sort of check like this.

@AndyAyersMS
Copy link
Member Author

Reviewing optional pipeline failures, they are known issues or also happen in the baseline run.

@@ -5736,6 +5736,65 @@ HCIMPL3(void, JIT_VTableProfile64, Object* obj, CORINFO_METHOD_HANDLE baseMethod
}
HCIMPLEND

HCIMPL1(void, JIT_CountProfile32, volatile LONG* pCounter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to have comments here, if nothing else, to link to the design paper.


This sort of counter seems well-suited for use in our Dynamic PGO instrumentation.

It may be that approximate counting will be useful in other application areas where scala
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence got cut off here - presumably you just mean to say where scalability is needed but small errors are acceptable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thanks.


As we count higher the standard deviation is limited by $\sigma \approx \sqrt{NP}$, so when we double $N$ and halve $P$ the variance $\sigma$ remains roughly the same overall.

If (via the benchmark) we look at how tunable the scalability is, we see that the higher the threshold for switching to probabilistic counting, the higher the cost (but of course the better the accuracy):
Copy link
Member

@markples markples Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion of the threshold (both in your brief talk and here) has been the most confusing point for me. Essentially, why does switching earlier (say 512 vs 8192) cause counts 1-2 million to be so different?

The answer appears to be that the cutoff is associated with the beginning of the -scaling- of N, which is a byproduct of using a single tuning variable. If one used N=2 for 512-16k and then started increasing N, then the graph would differ between 512 and 8k (with the error being the worse in that interval at 512) and then match afterwards (aside from any cumulative differences from 512-8k though those would eventually become insignificant).

I don't know if you want to change the descriptions at all for this though. Maybe it only confused me.

I guess it does suggest that if profiling goals dictate, the start/scaling could be adjusted separately to meet those goals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there is considerable flexibility in deciding when to change the increment/probability and how much to change it by and the change points don't need to be powers of two or spaced in any orderly way.

The paper I referenced describes a mode where once the increments are large enough and probability of updates are low enough, they just plateau there, assuming the overhead is now so insignificant that further decreases in the probability of updates won't matter much, so counting ends up being relatively more accurate for very large counts then for smaller counts.


So if we start probabilistically incrementing by $2$ with probability $1/2$ at $8192$, then after $8192$ probabilistic updates we have added an expected value of $8192 \cdot 2 \cdot 1/2 = 8192$ to the counter.

The variance in the actual number of updates is $\sqrt{2^{13} \cdot 1/2 \cdot (1-1/2)} = \sqrt{2^{11}} \approx 45$. Each update is by 2, so the two standard deviation expected range for the change in the counter value is $2 \cdot 2 \cdot 45 \approx 180$. The relative error range is thus $\pm 180 / 8192 \approx \pm 0.022$. This is in reasonable agreement with the empirical study above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an interesting thing (coincidence?) going on here. The empirical study is going to include the effects of perfect measurement from 0-8192, so the reported empirical value for 16k is going to be half this relative error, or 0.011. This matches the graph very well.

This calculation for each additional section quickly goes to 0.03-0.031. However, again, the measured cumulative error is going to be smaller. It turns out to be close to that initial calculation.

One thing I can't figure out is the graph where you vary the starting point. If I do the calculation for 10, I get 0.0625 (and even halving it for the cumulative effect I get ~0.03). However, the first data point off the center line is 1.005. (Or maybe I should be looking at the next one, which is close to 1.03? The x-axis seems confusing here because that "first" point appears to be over 1k when it should be over 2k since 1k is the crossover point. Maybe? And I guess that I need to point out that the non-powers-of-2 on the x-axis are a little worrying :))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. I think there's a bug in my simulation code, the log should be

   int logCount = 32 - (int) uint.LeadingZeroCount(count);

So the simulation starts probabilistic mode sooner than it should. Let me rerun this and see if the graph makes more sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's not the issue, the threshold is OK. When I added a parallel mode to the simulation to I wanted to make sure there was evenly divisible counting I rounded up the count to the next multiple of the number of CPUs. So that skews the data a bit as you noted, and so yes you should look at "the next one".

Here is a revised plot without the extra counts:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That chart matches what I found. It's fascinating how such a seemingly simple idea works out so well.

@AndyAyersMS
Copy link
Member Author

Artifacts are not showing up for test failures, so presumably build analysis can't function either.

The failure is almost certainly unrelated.

@AndyAyersMS
Copy link
Member Author

@MichalStrehovsky looks like you got auto-assigned as reviewer. I'm going to merge and pick up any feedback from you later.

@AndyAyersMS AndyAyersMS merged commit e641efb into dotnet:main Apr 9, 2023
@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 10, 2023

Impact on linux x64 platform plaintext -- blue is tier0 instrumented, orange is tier0 un-instrumented

image

Impact on linux ampere arm64 platform plaintext -- blue is tier0 instrumented, orange is tier0 un-instrumented

image

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Apr 24, 2023

Seems like this caused a regression in at least one microbenchmark with Dynamic PGO [link]

newplot - 2023-04-24T085853 752

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants