-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: scalable profile counter mode #84427
Conversation
Add an config option to use a "scalable" profile helper for edge counters, where we try and avoid contention by probablistic updates once the counter value exceeds some threshold (currently 8192). Using the current xorshift RNG this gives the counter a two-sigma accuracy of around +/- 2%. The idea is loosely based on "Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13). Also allow the scalable and interlocked profile modes to operate at the same time, recording two sets of counts per probe, so we can verify that this new mode is sufficiently accurate.
@EgorBo PTAL I am enabling by default here -- we could reconsider and just merge this disabled, but local testing looks quite promising. Will cause significant diffs in instrumented code. |
Nice! |
unsigned int logCount = 0; | ||
BitScanReverse(&logCount, count); | ||
|
||
if (logCount >= 13) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you have a comment about 13
, probably just point to the paper Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13).
at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That paper won't help you understand why this value is 13, but the design note I added will.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsAdd an config option to use a "scalable" profile helper for edge counters, where we try and avoid contention by probablistic updates once the counter value exceeds some threshold (currently 8192). Using the current xorshift RNG this gives the counter a two-sigma accuracy of around +/- 2%. The idea is loosely based on "Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13). Also allow the scalable and interlocked profile modes to operate at the same time, recording two sets of counts per probe, so we can verify that this new mode is sufficiently accurate.
|
CI is not triggering the right tests. Going to try bouncing this. |
Two questions:
|
Probably need to update something for NAOT since I added new jit helpers in the middle of the enum space. |
Note SPMI can't help us here because I changed the JIT GUID. We will have to generate diffs/tp impact retrospectively by disabling this in a trial PR once it's in. |
The counters in class probes no longer count very high -- are you suggesting that they could? I don't know if we really need the 64 bit capabilities for dynamci PGO, but I'd like to keep it viable in case we do need it sometime. Also I think we enable it in the optimization repo just in case we might overflow counts, since we force everything to stay in tier0. |
There are no artifacts for the arm linux musl failure, so build analysis can't determine if it is a known issue. But the failure is in |
Going to run a few more tests locally before I merge, and pgo stress. Note some of those optional pipelines have known failures. |
/azp run runtime-coreclr pgo, runtime-coreclr pgostress, runtime-coreclr libraries-pgo |
Azure Pipelines successfully started running 3 pipeline(s). |
{ | ||
int logCount = 31 - (int) uint.LeadingZeroCount(count); | ||
|
||
if (logCount >= 13) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a sort of fast path here? E.g.
if (count & 0x3FFFF) // is counter already large enough?
{
int logCount = 31 - (int) uint.LeadingZeroCount(count);
..
}
e.g. LZC is not cheap on arm since there is no scalar version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, we can probably inline it in JIT codegen if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You confused me for a second commenting on the version in the doc and not the code in the runtime itself.
Yes we can do some sort of check like this.
Reviewing optional pipeline failures, they are known issues or also happen in the baseline run.
|
@@ -5736,6 +5736,65 @@ HCIMPL3(void, JIT_VTableProfile64, Object* obj, CORINFO_METHOD_HANDLE baseMethod | |||
} | |||
HCIMPLEND | |||
|
|||
HCIMPL1(void, JIT_CountProfile32, volatile LONG* pCounter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to have comments here, if nothing else, to link to the design paper.
|
||
This sort of counter seems well-suited for use in our Dynamic PGO instrumentation. | ||
|
||
It may be that approximate counting will be useful in other application areas where scala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentence got cut off here - presumably you just mean to say where scalability is needed but small errors are acceptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, thanks.
|
||
As we count higher the standard deviation is limited by $\sigma \approx \sqrt{NP}$, so when we double $N$ and halve $P$ the variance $\sigma$ remains roughly the same overall. | ||
|
||
If (via the benchmark) we look at how tunable the scalability is, we see that the higher the threshold for switching to probabilistic counting, the higher the cost (but of course the better the accuracy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion of the threshold (both in your brief talk and here) has been the most confusing point for me. Essentially, why does switching earlier (say 512 vs 8192) cause counts 1-2 million to be so different?
The answer appears to be that the cutoff is associated with the beginning of the -scaling- of N, which is a byproduct of using a single tuning variable. If one used N=2 for 512-16k and then started increasing N, then the graph would differ between 512 and 8k (with the error being the worse in that interval at 512) and then match afterwards (aside from any cumulative differences from 512-8k though those would eventually become insignificant).
I don't know if you want to change the descriptions at all for this though. Maybe it only confused me.
I guess it does suggest that if profiling goals dictate, the start/scaling could be adjusted separately to meet those goals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, there is considerable flexibility in deciding when to change the increment/probability and how much to change it by and the change points don't need to be powers of two or spaced in any orderly way.
The paper I referenced describes a mode where once the increments are large enough and probability of updates are low enough, they just plateau there, assuming the overhead is now so insignificant that further decreases in the probability of updates won't matter much, so counting ends up being relatively more accurate for very large counts then for smaller counts.
|
||
So if we start probabilistically incrementing by $2$ with probability $1/2$ at $8192$, then after $8192$ probabilistic updates we have added an expected value of $8192 \cdot 2 \cdot 1/2 = 8192$ to the counter. | ||
|
||
The variance in the actual number of updates is $\sqrt{2^{13} \cdot 1/2 \cdot (1-1/2)} = \sqrt{2^{11}} \approx 45$. Each update is by 2, so the two standard deviation expected range for the change in the counter value is $2 \cdot 2 \cdot 45 \approx 180$. The relative error range is thus $\pm 180 / 8192 \approx \pm 0.022$. This is in reasonable agreement with the empirical study above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an interesting thing (coincidence?) going on here. The empirical study is going to include the effects of perfect measurement from 0-8192, so the reported empirical value for 16k is going to be half this relative error, or 0.011. This matches the graph very well.
This calculation for each additional section quickly goes to 0.03-0.031. However, again, the measured cumulative error is going to be smaller. It turns out to be close to that initial calculation.
One thing I can't figure out is the graph where you vary the starting point. If I do the calculation for 10, I get 0.0625 (and even halving it for the cumulative effect I get ~0.03). However, the first data point off the center line is 1.005. (Or maybe I should be looking at the next one, which is close to 1.03? The x-axis seems confusing here because that "first" point appears to be over 1k when it should be over 2k since 1k is the crossover point. Maybe? And I guess that I need to point out that the non-powers-of-2 on the x-axis are a little worrying :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. I think there's a bug in my simulation code, the log should be
int logCount = 32 - (int) uint.LeadingZeroCount(count);
So the simulation starts probabilistic mode sooner than it should. Let me rerun this and see if the graph makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's not the issue, the threshold is OK. When I added a parallel mode to the simulation to I wanted to make sure there was evenly divisible counting I rounded up the count to the next multiple of the number of CPUs. So that skews the data a bit as you noted, and so yes you should look at "the next one".
Here is a revised plot without the extra counts:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That chart matches what I found. It's fascinating how such a seemingly simple idea works out so well.
Artifacts are not showing up for test failures, so presumably build analysis can't function either. The failure is almost certainly unrelated. |
@MichalStrehovsky looks like you got auto-assigned as reviewer. I'm going to merge and pick up any feedback from you later. |
Seems like this caused a regression in at least one microbenchmark with Dynamic PGO [link] |
Add an config option to use a "scalable" profile helper for edge counters, where we try and avoid contention by probablistic updates once the counter value exceeds some threshold (currently 8192). Using the current xorshift RNG this gives the counter a two-sigma accuracy of around +/- 2%.
The idea is loosely based on "Scalable Statistics Counters" by Dice, Lev, and Moir (SPAA’13).
Also allow the scalable and interlocked profile modes to operate at the same time, recording two sets of counts per probe, so we can verify that this new mode is sufficiently accurate.