JIT: scalable profile counter mode #84427

AndyAyersMS · 2023-04-06T17:14:18Z

Add an config option to use a "scalable" profile helper for edge counters, where we try and avoid contention by probablistic updates once the counter value exceeds some threshold (currently 8192). Using the current xorshift RNG this gives the counter a two-sigma accuracy of around +/- 2%.

The idea is loosely based on "Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13).

Also allow the scalable and interlocked profile modes to operate at the same time, recording two sets of counts per probe, so we can verify that this new mode is sufficiently accurate.

Add an config option to use a "scalable" profile helper for edge counters, where we try and avoid contention by probablistic updates once the counter value exceeds some threshold (currently 8192). Using the current xorshift RNG this gives the counter a two-sigma accuracy of around +/- 2%. The idea is loosely based on "Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13). Also allow the scalable and interlocked profile modes to operate at the same time, recording two sets of counts per probe, so we can verify that this new mode is sufficiently accurate.

AndyAyersMS · 2023-04-06T17:15:33Z

@EgorBo PTAL
cc @dotnet/jit-contrib

I am enabling by default here -- we could reconsider and just merge this disabled, but local testing looks quite promising.

Will cause significant diffs in instrumented code.

EgorBo · 2023-04-06T17:57:47Z

Nice!

kunalspathak · 2023-04-06T19:17:32Z

src/coreclr/vm/jithelpers.cpp

+        unsigned int logCount = 0;
+        BitScanReverse(&logCount, count);
+
+        if (logCount >= 13)


Can you have a comment about 13, probably just point to the paper Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13). at least?

That paper won't help you understand why this value is 13, but the design note I added will.

ghost · 2023-04-06T19:38:53Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Add an config option to use a "scalable" profile helper for edge counters, where we try and avoid contention by probablistic updates once the counter value exceeds some threshold (currently 8192). Using the current xorshift RNG this gives the counter a two-sigma accuracy of around +/- 2%.

The idea is loosely based on "Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13).

Also allow the scalable and interlocked profile modes to operate at the same time, recording two sets of counts per probe, so we can verify that this new mode is sufficiently accurate.

Author:	AndyAyersMS
Assignees:	AndyAyersMS
Labels:	`area-CodeGen-coreclr`, `needs-area-label`
Milestone:	-

AndyAyersMS · 2023-04-06T20:02:42Z

CI is not triggering the right tests. Going to try bouncing this.

EgorBo · 2023-04-06T20:41:15Z

Two questions:

Do you plan to do the same with class probes?
Do we need 64bit probes, aren't 32bit counters enough with your algorithm?

AndyAyersMS · 2023-04-06T21:43:42Z

Probably need to update something for NAOT since I added new jit helpers in the middle of the enum space.

AndyAyersMS · 2023-04-06T23:06:32Z

Note SPMI can't help us here because I changed the JIT GUID. We will have to generate diffs/tp impact retrospectively by disabling this in a trial PR once it's in.

AndyAyersMS · 2023-04-07T00:59:01Z

Two questions:

Do you plan to do the same with class probes?

Do we need 64bit probes, aren't 32bit counters enough with your algorithm?

The counters in class probes no longer count very high -- are you suggesting that they could?

I don't know if we really need the 64 bit capabilities for dynamci PGO, but I'd like to keep it viable in case we do need it sometime. Also I think we enable it in the optimization repo just in case we might overflow counts, since we force everything to stay in tier0.

AndyAyersMS · 2023-04-07T01:02:29Z

There are no artifacts for the arm linux musl failure, so build analysis can't determine if it is a known issue. But the failure is in System.Threading.Tasks.Dataflow.Tests so it looks like it could be #80857.

AndyAyersMS · 2023-04-07T02:26:44Z

Going to run a few more tests locally before I merge, and pgo stress. Note some of those optional pipelines have known failures.

AndyAyersMS · 2023-04-07T02:27:45Z

/azp run runtime-coreclr pgo, runtime-coreclr pgostress, runtime-coreclr libraries-pgo

azure-pipelines · 2023-04-07T02:28:12Z

Azure Pipelines successfully started running 3 pipeline(s).

EgorBo · 2023-04-07T14:29:55Z

docs/design/features/ScalableApproximateCounting.md

+        {
+            int logCount = 31 - (int) uint.LeadingZeroCount(count);
+
+            if (logCount >= 13)


can we have a sort of fast path here? E.g.

if (count & 0x3FFFF) // is counter already large enough? { int logCount = 31 - (int) uint.LeadingZeroCount(count); .. }

e.g. LZC is not cheap on arm since there is no scalar version

Moreover, we can probably inline it in JIT codegen if needed

You confused me for a second commenting on the version in the doc and not the code in the runtime itself.

Yes we can do some sort of check like this.

AndyAyersMS · 2023-04-07T15:10:11Z

Reviewing optional pipeline failures, they are known issues or also happen in the baseline run.

libraries-pgo: QUIC failures on linux #82771, system.runtime.intrinsic is in baseline (no issue)
coreclr-pgo: Test failure GC\\API\\Frozen\\Frozen\\Frozen #84462
pgo-stress: baseservices\exceptions\stackoverflow\stackoverflowtester fails under pgo stress on windows arm #81360, Test failure GC\\API\\Frozen\\Frozen\\Frozen #84462

BruceForstall · 2023-04-07T16:55:43Z

src/coreclr/vm/jithelpers.cpp

@@ -5736,6 +5736,65 @@ HCIMPL3(void, JIT_VTableProfile64, Object* obj, CORINFO_METHOD_HANDLE baseMethod
 }
 HCIMPLEND

+HCIMPL1(void, JIT_CountProfile32, volatile LONG* pCounter)


It would be useful to have comments here, if nothing else, to link to the design paper.

markples · 2023-04-07T20:33:23Z

docs/design/features/ScalableApproximateCounting.md

+
+This sort of counter seems well-suited for use in our Dynamic PGO instrumentation.
+
+It may be that approximate counting will be useful in other application areas where scala


Sentence got cut off here - presumably you just mean to say where scalability is needed but small errors are acceptable

yep, thanks.

markples · 2023-04-07T20:58:51Z

docs/design/features/ScalableApproximateCounting.md

+
+As we count higher the standard deviation is limited by $\sigma \approx \sqrt{NP}$, so when we double $N$ and halve $P$ the variance $\sigma$ remains roughly the same overall.
+
+If (via the benchmark) we look at how tunable the scalability is, we see that the higher the threshold for switching to probabilistic counting, the higher the cost (but of course the better the accuracy):


Discussion of the threshold (both in your brief talk and here) has been the most confusing point for me. Essentially, why does switching earlier (say 512 vs 8192) cause counts 1-2 million to be so different?

The answer appears to be that the cutoff is associated with the beginning of the -scaling- of N, which is a byproduct of using a single tuning variable. If one used N=2 for 512-16k and then started increasing N, then the graph would differ between 512 and 8k (with the error being the worse in that interval at 512) and then match afterwards (aside from any cumulative differences from 512-8k though those would eventually become insignificant).

I don't know if you want to change the descriptions at all for this though. Maybe it only confused me.

I guess it does suggest that if profiling goals dictate, the start/scaling could be adjusted separately to meet those goals.

Right, there is considerable flexibility in deciding when to change the increment/probability and how much to change it by and the change points don't need to be powers of two or spaced in any orderly way.

The paper I referenced describes a mode where once the increments are large enough and probability of updates are low enough, they just plateau there, assuming the overhead is now so insignificant that further decreases in the probability of updates won't matter much, so counting ends up being relatively more accurate for very large counts then for smaller counts.

markples · 2023-04-07T21:28:26Z

docs/design/features/ScalableApproximateCounting.md

+
+So if we start probabilistically incrementing by $2$ with probability $1/2$ at $8192$, then after $8192$ probabilistic updates we have added an expected value of $8192 \cdot 2 \cdot 1/2 = 8192$ to the counter.
+
+The variance in the actual number of updates is $\sqrt{2^{13} \cdot 1/2 \cdot (1-1/2)} = \sqrt{2^{11}} \approx 45$. Each update is by 2, so the two standard deviation expected range for the change in the counter value is $2 \cdot 2 \cdot 45 \approx 180$. The relative error range is thus $\pm 180 / 8192 \approx \pm 0.022$. This is in reasonable agreement with the empirical study above.


I think there's an interesting thing (coincidence?) going on here. The empirical study is going to include the effects of perfect measurement from 0-8192, so the reported empirical value for 16k is going to be half this relative error, or 0.011. This matches the graph very well.

This calculation for each additional section quickly goes to 0.03-0.031. However, again, the measured cumulative error is going to be smaller. It turns out to be close to that initial calculation.

One thing I can't figure out is the graph where you vary the starting point. If I do the calculation for 10, I get 0.0625 (and even halving it for the cumulative effect I get ~0.03). However, the first data point off the center line is 1.005. (Or maybe I should be looking at the next one, which is close to 1.03? The x-axis seems confusing here because that "first" point appears to be over 1k when it should be over 2k since 1k is the crossover point. Maybe? And I guess that I need to point out that the non-powers-of-2 on the x-axis are a little worrying :))

Ah, good catch. I think there's a bug in my simulation code, the log should be

int logCount = 32 - (int) uint.LeadingZeroCount(count);

So the simulation starts probabilistic mode sooner than it should. Let me rerun this and see if the graph makes more sense.

No, that's not the issue, the threshold is OK. When I added a parallel mode to the simulation to I wanted to make sure there was evenly divisible counting I rounded up the count to the next multiple of the number of CPUs. So that skews the data a bit as you noted, and so yes you should look at "the next one".

Here is a revised plot without the extra counts:

That chart matches what I found. It's fascinating how such a seemingly simple idea works out so well.

AndyAyersMS · 2023-04-09T21:20:44Z

Artifacts are not showing up for test failures, so presumably build analysis can't function either.

The failure is almost certainly unrelated.

AndyAyersMS · 2023-04-09T21:22:42Z

@MichalStrehovsky looks like you got auto-assigned as reviewer. I'm going to merge and pick up any feedback from you later.

AndyAyersMS · 2023-04-10T16:34:12Z

Impact on linux x64 platform plaintext -- blue is tier0 instrumented, orange is tier0 un-instrumented

Impact on linux ampere arm64 platform plaintext -- blue is tier0 instrumented, orange is tier0 un-instrumented

AndyAyersMS · 2023-04-24T16:02:49Z

Seems like this caused a regression in at least one microbenchmark with Dynamic PGO [link]

ghost assigned AndyAyersMS Apr 6, 2023

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 6, 2023

AndyAyersMS added 2 commits April 6, 2023 12:13

fix build

04482c8

remove trailing whitespace

ddc9178

kunalspathak reviewed Apr 6, 2023

View reviewed changes

make both linux and windows happy

150becd

AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 6, 2023

AndyAyersMS removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 6, 2023

AndyAyersMS closed this Apr 6, 2023

AndyAyersMS reopened this Apr 6, 2023

fix CorInfoHelpFunc for managed jit interface

8487b01

build-analysis bot mentioned this pull request Apr 6, 2023

WindowsServiceLifetimeTests.ServiceCanStopItself fails in CI #84438

Closed

AndyAyersMS requested a review from MichalStrehovsky as a code owner April 6, 2023 23:06

EgorBo reviewed Apr 7, 2023

View reviewed changes

EgorBo approved these changes Apr 7, 2023

View reviewed changes

BruceForstall reviewed Apr 7, 2023

View reviewed changes

markples reviewed Apr 7, 2023

View reviewed changes

AndyAyersMS added 3 commits April 7, 2023 16:45

review feedback

3cd2165

merge main

73fa997

merge main

a83205e

AndyAyersMS merged commit e641efb into dotnet:main Apr 9, 2023

AndyAyersMS mentioned this pull request Apr 10, 2023

Dynamic PGO startup improvements in NET 8 #76969

Closed

23 tasks

AndyAyersMS mentioned this pull request Apr 24, 2023

Investigate microbenchmarks that regress with PGO enabled #84264

Closed

JulieLeeMSFT mentioned this pull request May 13, 2023

What's new in .NET 8 Preview 4 [WIP] dotnet/core#8234

Closed

3 tasks

EgorBo mentioned this pull request May 18, 2023

Make call-counting, class probes, block counters cache-friendly #72387

Closed

ghost locked as resolved and limited conversation to collaborators May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: scalable profile counter mode #84427

JIT: scalable profile counter mode #84427

AndyAyersMS commented Apr 6, 2023

AndyAyersMS commented Apr 6, 2023

EgorBo commented Apr 6, 2023

kunalspathak Apr 6, 2023

AndyAyersMS Apr 6, 2023

ghost commented Apr 6, 2023

AndyAyersMS commented Apr 6, 2023

EgorBo commented Apr 6, 2023

AndyAyersMS commented Apr 6, 2023

AndyAyersMS commented Apr 6, 2023

AndyAyersMS commented Apr 7, 2023

AndyAyersMS commented Apr 7, 2023

AndyAyersMS commented Apr 7, 2023 •

edited

Loading

AndyAyersMS commented Apr 7, 2023

azure-pipelines bot commented Apr 7, 2023

EgorBo Apr 7, 2023 •

edited

Loading

EgorBo Apr 7, 2023

AndyAyersMS Apr 7, 2023

AndyAyersMS commented Apr 7, 2023

BruceForstall Apr 7, 2023

markples Apr 7, 2023

AndyAyersMS Apr 7, 2023

markples Apr 7, 2023 •

edited

Loading

AndyAyersMS Apr 7, 2023

markples Apr 7, 2023

AndyAyersMS Apr 7, 2023

AndyAyersMS Apr 7, 2023

markples Apr 14, 2023

AndyAyersMS commented Apr 9, 2023

AndyAyersMS commented Apr 9, 2023

AndyAyersMS commented Apr 10, 2023 •

edited

Loading

AndyAyersMS commented Apr 24, 2023 •

edited

Loading


		This sort of counter seems well-suited for use in our Dynamic PGO instrumentation.

		It may be that approximate counting will be useful in other application areas where scala


		As we count higher the standard deviation is limited by $\sigma \approx \sqrt{NP}$, so when we double $N$ and halve $P$ the variance $\sigma$ remains roughly the same overall.

		If (via the benchmark) we look at how tunable the scalability is, we see that the higher the threshold for switching to probabilistic counting, the higher the cost (but of course the better the accuracy):


		So if we start probabilistically incrementing by $2$ with probability $1/2$ at $8192$, then after $8192$ probabilistic updates we have added an expected value of $8192 \cdot 2 \cdot 1/2 = 8192$ to the counter.

		The variance in the actual number of updates is $\sqrt{2^{13} \cdot 1/2 \cdot (1-1/2)} = \sqrt{2^{11}} \approx 45$. Each update is by 2, so the two standard deviation expected range for the change in the counter value is $2 \cdot 2 \cdot 45 \approx 180$. The relative error range is thus $\pm 180 / 8192 \approx \pm 0.022$. This is in reasonable agreement with the empirical study above.

JIT: scalable profile counter mode #84427

JIT: scalable profile counter mode #84427

Conversation

AndyAyersMS commented Apr 6, 2023

AndyAyersMS commented Apr 6, 2023

EgorBo commented Apr 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Apr 6, 2023

AndyAyersMS commented Apr 6, 2023

EgorBo commented Apr 6, 2023

AndyAyersMS commented Apr 6, 2023

AndyAyersMS commented Apr 6, 2023

AndyAyersMS commented Apr 7, 2023

AndyAyersMS commented Apr 7, 2023

AndyAyersMS commented Apr 7, 2023 • edited Loading

AndyAyersMS commented Apr 7, 2023

azure-pipelines bot commented Apr 7, 2023

EgorBo Apr 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyAyersMS commented Apr 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markples Apr 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyAyersMS commented Apr 9, 2023

AndyAyersMS commented Apr 9, 2023

AndyAyersMS commented Apr 10, 2023 • edited Loading

Impact on linux x64 platform plaintext -- blue is tier0 instrumented, orange is tier0 un-instrumented

Impact on linux ampere arm64 platform plaintext -- blue is tier0 instrumented, orange is tier0 un-instrumented

AndyAyersMS commented Apr 24, 2023 • edited Loading

AndyAyersMS commented Apr 7, 2023 •

edited

Loading

EgorBo Apr 7, 2023 •

edited

Loading

markples Apr 7, 2023 •

edited

Loading

AndyAyersMS commented Apr 10, 2023 •

edited

Loading

AndyAyersMS commented Apr 24, 2023 •

edited

Loading