[release/8.0-staging] Fixes deadlock for IncrementingPollingCounter callbacks #108648
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Modified backport of #105548 to release/8.0-staging
/cc @noahfalk @eterekhin
Customer Impact
The servicing request comes from Microsoft Exchange team via internal email. This bug causes their service to occasionally hang at startup when a monitoring tool has enabled listening to the System.Runtime EventCounters. We've already had variants of this bug reported by multiple external customers, for example #93175.
The underlying issue is a deadlock caused by a lock ordering issue between the static constructor lock and the EventListener lock. It is fixed by changing the thread we issue the IncrementingPollingCounter callback on so that the EventListener lock isn't held when the callback runs.
Regression
To the best of my understanding this bug has been present since the counters were first introduced in .NET Core 3. However its possible that specific details have shifted over time allowing the bug to be hit more easily.
Testing
I manually tested in a debugger stepping through all the modified code and verifying the expected behavior.
Risk
Low - I have guarded all the changed behavior with an opt-in AppContext switch (System.Diagnostics.Tracing.CounterCallbackOnTimerThread) and verified in the debugger that the switch operates as expected. The code change is also relatively isolated and has gotten some testing in our 9.0 development branches.
More details about the code change
This is a modified backport of #105548. It mostly preserves the logic of the original fix in .NET 9 with a few adjustments: