Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HeadlessUnitTestSession creation race condition #12979

Merged
merged 4 commits into from
Sep 24, 2023

Conversation

MrJul
Copy link
Member

@MrJul MrJul commented Sep 21, 2023

What does the pull request do?

This PR fixes two race conditions related to the headless unit tests when run with xUnit, which resulted in hangs or exceptions.

How was the solution implemented?

Problem 1: HeadlessUnitTestSession (race condition)

HeadlessUnitTestSession could be incorrectly created multiple times for the same assembly.

ConcurrentDictionary.GetOrAdd shouldn't be used if the valueFactory has side effects since it can be called multiple times, as documented:

Since a key/value can be inserted by another thread while valueFactory is generating a value, you cannot trust that just because valueFactory executed, its produced value will be inserted into the dictionary and returned. If you call GetOrAdd simultaneously on different threads, valueFactory may be called multiple times, but only one key/value pair will be added to the dictionary.

This is very easy to forget (I've fallen into this trap in the past). When valueFactory can only be called once and concurrency throughput is important, one should ideally call the lock-free TryGetValue, then take a lock to insert the value if it's missing: example. Since we're talking about unit tests here, I've simply used an old-school Dictionary with a lock to simplify things: the lock isn't a bottleneck at all here.

I've also made sure that errors are always reported if there's any problem in HeadlessUnitTestSession instead of hanging without any output: EnsureApplication wasn't called inside the try block, effectively crashing the session silently without completing the current test. The Dispatch call is also done a bit later in the test pipeline so xUnit can associate the exception with the current test.

Problem 2: MediaContext (race condition)

After the fix for HeadlessUnitTestSession, we now have a single thread executing the unit tests, but still have some random race conditions.
As always with this type of issue, this was very hard to diagnose: on my machine, the problem occurred once every hundreds of runs, but it's much more frequent on the CI instances. I finally managed to have a stable enough repro.

The culprit is the renderer thread trying to post back to the UI thread:

Dispatcher.UIThread.Post(() => CompositionBatchFinished(compositor, commit), DispatcherPriority.Send));

When an headless test finishes executing, it resets the Dispatcher.UIThread to null. The next test can then get a fresh new dispatcher.
The race was the new unit test trying to create the UI thread while the MediaContext from a previous test was also creating a new one.

Depending on which UI thread won, this manifested either in an exception in Dispatcher.PushFrame or a hang in that same method, with the test never finishing (it's running on the wrong dispatcher) and thus never stopping the DispatcherFrame.

While it can be fixed by adding a lock to the UIThread getter, it's still wrong for a previous MediaContext to post to a new, unrelated, dispatcher.
Instead, both Compositor and MediaContext now capture the UI thread dispatcher when they're created, and use it directly.
Since the dispatcher is now properly shutdown between unit tests with this PR, the Post operation from the MediaContext will never run if the dispatcher was shutdown.

Those modifications are imperceptible for normal apps, where there's one and only UIThread dispatcher.

Problem 3: AvaloniaTestCase (deadlock)

Fixing the issue with multiple sessions made an existing potential deadlock more likely to happen in AvaloniaTestCase. In this class, the xUnit thread running the test is blocked to allow its concurrency limit to work, as commented. If the max concurrency is reached, we have a deadlock: there's no xUnit thread free to continue the work since are all blocked waiting.

This is solved by running the test itself in a task, while the xUnit threads are simply waiting.

Fixed issues

Remaining issue

HeadlessUnitTestSession is still broken for multiple assemblies: there should be an unique dispatch loop for all sessions.
Currently, multiple sessions will set and fight for their own UI thread in parallel, quickly crashing or hanging.

@MrJul MrJul added area-infrastructure Issues related to CI/tooling infrastructur area-headless labels Sep 21, 2023
@MrJul
Copy link
Member Author

MrJul commented Sep 21, 2023

aaaand it's a fail :( the CI will timeout on Avalonia.Headless.XUnit.UnitTests again.

The race definitely exists (I logged it locally) and the hangs are easily reproduced by creating one session per test... I'm out of ideas for now.

@MrJul MrJul marked this pull request as draft September 21, 2023 20:24
@maxkatz6 maxkatz6 self-requested a review September 21, 2023 20:51
@maxkatz6
Copy link
Member

I think something is wrong with this new "lock" as well, as it fails every single build on this PR now.

@MrJul
Copy link
Member Author

MrJul commented Sep 23, 2023

Definitely. At this point the build server is trolling me hard :(
After my second fix (the UIThread creation race), I simply can't reproduce it locally, I let the tests ran in loop over the night: over 20000 runs without failure or hanging. Yet it fails immediately on the CI pipeline.

@MrJul
Copy link
Member Author

MrJul commented Sep 23, 2023

Got it! Now easily reproduced with MaxParallelThreads = 2 which matches the build agent having only 2 cores.
It was a nice deadlock (see problem 3 in my edited issue description).

@MrJul MrJul marked this pull request as ready for review September 23, 2023 15:30
@avaloniaui-team
Copy link
Contributor

You can test this PR using the following package version. 11.0.999-cibuild0039867-beta. (feed url: https://nuget-feed-all.avaloniaui.net/v3/index.json) [PRBUILDID]

@kekekeks
Copy link
Member

Could we use simple Dispatcher naming for Compositor/MediaContext? It doesn't technically have to be a UI thread dispatcher, just dispatcher associated with a particular Compositor/MediaContext.

In general we are not opposed to having multiple dispatchers at some point later, multiple UI threads feature is being requested from time to time and it would be nice to run tests in parallel. Something like VisualTarget/HostVisual isn't out of the question too.

@MrJul
Copy link
Member Author

MrJul commented Sep 23, 2023

Sure thing, I just renamed the appropriate members.

Nice to know you're open of potentially having multiple dispatchers in the future! This will open new scenarios and will require less reliance on the global service locator in various implementations, which is a good thing imo.

@avaloniaui-team
Copy link
Contributor

You can test this PR using the following package version. 11.0.999-cibuild0039869-beta. (feed url: https://nuget-feed-all.avaloniaui.net/v3/index.json) [PRBUILDID]

@maxkatz6 maxkatz6 added this pull request to the merge queue Sep 24, 2023
Merged via the queue into AvaloniaUI:master with commit 9548bf7 Sep 24, 2023
5 checks passed
@MrJul MrJul deleted the fix/headless-xunit-race branch September 24, 2023 10:23
@MrJul MrJul added the backport-candidate-11.0.x Consider this PR for backporting to 11.0 branch label Sep 30, 2023
grokys pushed a commit that referenced this pull request Oct 2, 2023
* Fix HeadlessUnitTestSession creation race condition

* Bind Compositor/MediaContext to a fixed UI thread

* Fix dead lock in AvaloniaTestCase

* Rename Compositor.UIThreadDispatcher to Dispatcher
@grokys grokys added backported-11.0.x and removed backport-candidate-11.0.x Consider this PR for backporting to 11.0 branch labels Oct 14, 2023
@yankun
Copy link
Contributor

yankun commented Nov 6, 2023

Actually it seems that these changes made it worse for me. I already had trouble with 11.0.0 with this race condition and worked around it so my tests run stable but now they are breaking again with the exception mentioned in #12588.

I'm investigating what could be the cause of it.

@MrJul
Copy link
Member Author

MrJul commented Nov 6, 2023

@yankun

Actually it seems that these changes made it worse for me. I already had trouble with 11.0.0 with this race condition and worked around it so my tests run stable but now they are breaking again with the exception mentioned in #12588.

Please note that running multiple assemblies in parallel still isn't supported, as mentioned in this PR (there's no synchronization between the different sessions yet). If you can get some stable repro with a single assembly, that would be great! Don't hesitate to open a new issue with details.

(Note that I haven't seen any Avalonia build pipeline timeout / failure with PlatformNotSupportedException since this PR was merged. It was frequent before, so it definitely improved things here. Obviously every code base is different.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-headless area-infrastructure Issues related to CI/tooling infrastructur backported-11.0.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XUnit package sometimes breaks tests with an System.PlatformNotSupportedException
6 participants