Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThreadPoolWorkQueue..ctor randomly will never return #104322

Closed
colejohnson66 opened this issue Jul 2, 2024 · 11 comments
Closed

ThreadPoolWorkQueue..ctor randomly will never return #104322

colejohnson66 opened this issue Jul 2, 2024 · 11 comments

Comments

@colejohnson66
Copy link

colejohnson66 commented Jul 2, 2024

Description

Periodically, I've experienced a deadlock on program start through ThreadPoolWorkQueue..ctor. According to the debugger, that constructor is stuck at RefreshLoggingEnabled, and it never continues. dotnet-dump provided this information, implying it's stuck in a "dynamic helper frame".

Loading core dump: .\dump_20240702_143359.dmp ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
> threads                                                                                                                                                                                                                   
*0 0x0BD4 (3028)
 1 0x6600 (26112)
 2 0x7554 (30036)
 3 0x6B30 (27440)
 4 0x283C (10300)
 5 0x7D70 (32112)
> clrstack -all                                                                                                                                                                                                             
OS Thread Id: 0xbd4
        Child SP               IP Call Site
000000D131D7C148 00007ff913170af4 [HelperMethodFrame_1OBJ: 000000d131d7c148] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000D131D7C2A0 00007FF85A9F791A System.Diagnostics.Tracing.EventSource.Initialize(System.Guid, System.String, System.String[]) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 1619]
000000D131D7C350 00007FF85AA04D39 System.Diagnostics.Tracing.FrameworkEventSource..ctor() [/_/src/coreclr/System.Private.CoreLib/System.Private.CoreLib.Generators/Generators.EventSourceGenerator/FrameworkEventSource.g.cs @ 12]
000000D131D7C3A0 00007FF85AA04D87 System.Diagnostics.Tracing.FrameworkEventSource..cctor() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/FrameworkEventSource.cs @ 14]
000000D131D7D530 00007ff87532b9e3 [DynamicHelperFrame: 000000d131d7d530]
000000D131D7D690 00007FF85A8F2883 System.Threading.TimerQueueTimer.Change(UInt32, UInt32) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs @ 559]
000000D131D7D720 00007FF85A8F2776 System.Threading.TimerQueueTimer..ctor(System.Threading.TimerCallback, System.Object, UInt32, UInt32, Boolean) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs @ 516]
000000D131D7D770 00007FF85A8F3451 System.Threading.Timer.TimerSetup(System.Threading.TimerCallback, System.Object, UInt32, UInt32, Boolean) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs @ 905]    
000000D131D7D7E0 00007FF85A8F3253 System.Threading.Timer..ctor(System.Threading.TimerCallback, System.Object, System.TimeSpan, System.TimeSpan) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs @ 861]
000000D131D7D840 00007FF8157B5B0A Microsoft.Data.Sqlite.SqliteConnectionFactory..ctor() [/_/src/Microsoft.Data.Sqlite.Core/SqliteConnectionFactory.cs @ 28]
000000D131D7D8F0 00007FF8157B58AC Microsoft.Data.Sqlite.SqliteConnectionFactory..cctor() [/_/src/Microsoft.Data.Sqlite.Core/SqliteConnectionFactory.cs @ 12]
000000D131D7E658 00007ff87532b9e3 [HelperMethodFrame: 000000d131d7e658]
000000D131D7E760 00007FF8157B57C3 Microsoft.Data.Sqlite.SqliteConnection.set_ConnectionString(System.String) [/_/src/Microsoft.Data.Sqlite.Core/SqliteConnection.cs @ 151]
000000D131D7E7F0 00007FF8157B4669 Microsoft.Data.Sqlite.SqliteConnection..ctor(System.String) [/_/src/Microsoft.Data.Sqlite.Core/SqliteConnection.cs @ 120]
000000D131D7E830 00007FF8157B44BF <removed>.Logging.Logger.OpenDatabaseCore(System.String, Boolean)
000000D131D7E8D0 00007FF8157B438B <removed>.Logging.Logger.OpenDatabaseReadWrite(System.String, Boolean ByRef)
000000D131D7E920 00007FF8157B428F <removed>.Logging.Logger.Start(System.String)
000000D131D7E990 00007FF8157B3C6E <removed>.Program+<Main>d__0.MoveNext()
000000D131D7EAA0 00007FF85AC1209A System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 38]
000000D131D7EB10 00007FF8157B3B35 <removed>.Program.Main(System.String[])
000000D131D7EB80 00007FF8157B3A34 <removed>.Program.<Main>(System.String[])
OS Thread Id: 0x7d70
        Child SP               IP Call Site
000000D13297FA80 00007ff913170af4 [DebuggerU2MCatchHandlerFrame: 000000d13297fa80]
OS Thread Id: 0x6600
        Child SP               IP Call Site
000000D13237DC20 00007ff913173b94 [DynamicHelperFrame: 000000d13237dc20]
000000D13237DD80 00007FF85A8EE7D9 System.Threading.ThreadPoolWorkQueue..ctor() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 426]
000000D13237DE60 00007FF85A8E025E System.Threading.ThreadPool..cctor() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 1354]
000000D13237F000 00007ff87532b9e3 [DynamicHelperFrame: 000000d13237f000]
000000D13237F160 00007FF85A8E005A System.Threading.ThreadPool.get_CompletedWorkItemCount() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.Windows.cs @ 204]
000000D13237F190 00007FF85AA077CA System.Diagnostics.Tracing.RuntimeEventSource+<>c.<OnEventCommand>b__35_9() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs @ 110]
000000D13237F1C0 00007FF85AA0567F System.Diagnostics.Tracing.IncrementingPollingCounter.UpdateMetric() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/IncrementingPollingCounter.cs @ 57]
000000D13237F220 00007FF85A9EDCC8 System.Diagnostics.Tracing.CounterGroup.ResetCounters() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 206]
000000D13237F290 00007FF85A9ED9FE System.Diagnostics.Tracing.CounterGroup.EnableTimer(Single) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 158]
000000D13237F2F0 00007FF85A9ED74F System.Diagnostics.Tracing.CounterGroup.OnEventSourceCommand(System.Object, System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 73]
000000D13237F370 00007FF85A9F9AD9 System.Diagnostics.Tracing.EventSource.DoCommand(System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2656]
000000D13237F3E0 00007FF85A9F98A2 System.Diagnostics.Tracing.EventSource.SendCommand(System.Diagnostics.Tracing.EventListener, System.Diagnostics.Tracing.EventProviderType, Int32, System.Diagnostics.Tracing.EventCommand, 
Boolean, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords, System.Collections.Generic.IDictionary`2<System.String,System.String>) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2574]
000000D13237F460 00007FF85A9FED8B System.Diagnostics.Tracing.EventSource+OverrideEventProvider.OnControllerCommand(System.Diagnostics.Tracing.ControllerCommand, System.Collections.Generic.IDictionary`2<System.String,System.String>, Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2438]
000000D13237F4C0 00007FF85A9F01BF System.Diagnostics.Tracing.EventPipeEventProvider.HandleEnableNotification(System.Diagnostics.Tracing.EventProvider, Byte*, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 58]
000000D13237F510 00007FF85A9F51A7 System.Diagnostics.Tracing.EventProviderImpl.ProviderCallback(System.Diagnostics.Tracing.EventProvider, Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1310]
000000D13237F570 00007FF85A9F027E System.Diagnostics.Tracing.EventPipeEventProvider.Callback(Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*, Void*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 68]
> syncblk                                                                                                                                                                                                                   
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
    1 0000020A0011AEC8            3         1 0000020A3F0F6130 6600   1   0000020a41c0b7f8 System.Collections.Generic.List`1[[System.WeakReference`1[[System.Diagnostics.Tracing.EventSource, System.Private.CoreLib]], System.Private.CoreLib]]
-----------------------------
Total           1
CCW             0
RCW             0
ComClassFactory 0
Free            0

Reproduction Steps

Unsure, but when it happens, restarting the program/debug session will usually fix it.

Expected behavior

Program starts the thread pool queue just fine and no deadlock occurs.

Actual behavior

Deadlock due to the lock never being released.

Regression?

No response

Known Workarounds

No response

Configuration

  • .NET 8.0.6
  • Windows 11 x86-64
  • JetBrains Rider
  • Microsoft.Data.SQLite

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jul 2, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 2, 2024
@teo-tsirpanis teo-tsirpanis added area-System.Threading and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jul 2, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@colejohnson66
Copy link
Author

colejohnson66 commented Jul 3, 2024

I'm not sure if I'm doing this right, but dumping the memory at the IP of the "dynamic helper frame" shows repeated 32 byte blocks of memory:

> db 7ff913173b94
00007ff913173b94: c3 cd 2e c3 0f 1f 84 00 00 00 00 00 4c 8b d1 b8  ............L...
00007ff913173ba4: e1 01 00 00 f6 04 25 08 03 fe 7f 01 75 03 0f 05  ......%.....u...
00007ff913173bb4: c3 cd 2e c3 0f 1f 84 00 00 00 00 00 4c 8b d1 b8  ............L...
00007ff913173bc4: e2 01 00 00 f6 04 25 08 03 fe 7f 01 75 03 0f 05  ......%.....u...
00007ff913173bd4: c3 cd 2e c3 0f 1f 84 00 00 00 00 00 4c 8b d1 b8  ............L...
00007ff913173be4: e3 01 00 00 f6 04 25 08 03 fe 7f 01 75 03 0f 05  ......%.....u...
00007ff913173bf4: c3 cd 2e c3 0f 1f 84 00 00 00 00 00 4c 8b d1 b8  ............L...
00007ff913173c04: e4 01 00 00 f6 04 25 08 03 fe 7f 01 75 03 0f 05  ......%.....u...

C3 is RET, yet it never returns? It looks like the IP is inside ntdll.dll?

> lm
... trimmed ...
00007FF9130D0000 00217000 C:\Windows\System32\ntdll.dll
Total image size: 72019968
> sos u 7ff913173b94
Unmanaged code

@mangod9
Copy link
Member

mangod9 commented Jul 3, 2024

Hey @colejohnson66, are you able to share a repro for this issue?

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Jul 3, 2024
@mangod9 mangod9 added this to the 9.0.0 milestone Jul 3, 2024
@colejohnson66
Copy link
Author

We unfortunately do not have anything we could share, and I haven’t narrowed anything down. I still have the dump if you’d like anything from it, and I can do some more debugging if necessary.

I wonder if #104457 is related? It being intermittent would be an indication of some race condition.

@KalleOlaviNiemitalo
Copy link

OS thread 0x6600 has locked EventListener.EventListenersLock, and I think it is waiting for the static constructor of FrameworkEventSource.

OS thread 0xbd4 has entered the static constructor of FrameworkEventSource, and is waiting to lock EventListener.EventListenersLock.

I wonder what caused EventPipeEventProvider.Callback to be called. Had you connected some diagnostic tools to the process before the deadlock occurred?

EventSource.SendCommand has code to enqueue commands if the EventSource hasn't finished initializing yet. Could the deadlock be fixed by making RuntimeEventSource also enqueue commands until FrameworkEventSource finishes initializing?

Alternatively, could the ThreadPool counters implemented in RuntimeEventSource.OnEventCommand return 0 if ThreadPool hasn't been initialized yet? Rather than wait for ThreadPool and FrameworkEventSource to be initialized.

@KalleOlaviNiemitalo
Copy link

I wonder if #104457 is related?

No, that's for TraceSource, and this deadlock is with EventSource.

@colejohnson66
Copy link
Author

This only started recently, so maybe Rider changed how something in the backend in the 2024.2 EAPs? I'm not ruling that possibility out.

My reproduction steps are: click the "debug" button, observe the debugger start, but not the GUI, click "pause" to investigate, then see the deadlock. Every time, it's the same stack trace. After a few times of this happening, I used dotnet-dump instead of pausing, and I got the same stack trace - the one above.

As for program execution, we have an async Task Main() (giving an async state machine), and the very first thing it does is start our SQLite-based logger. This calls into Microsoft.Data.SQLite to open the database. That stack trace on OS thread 0xBD4 is the first thing our program executes:

internal static class Program
{
    [STAThread]
    public static async Task Main(string[] args)
    {
        Logger.Instance.Start("Logs.db");

I'll see if the callback parameters have any information next time this happens.

@colejohnson66
Copy link
Author

colejohnson66 commented Jul 8, 2024

I can't access the stack variables in the debugger:

An IL variable is not available at the current native IP. (0x80131304). The error code is CORDBG_E_IL_VAR_NOT_AVAILABLE, or 0x80131304.

What I can see is that the event counter being incremented in threadpool-completed-items-count. This then runs the static constructor of ThreadPool, which constructs the ThreadPoolWorkQueue.

Also, if the Timer used by SQLiteConnectionFactory manages to win the race to initialize the thread pool, the program starts fine. It has this stack trace:

new ThreadPoolWorkQueue()
static ThreadPool()
TimerQueue.SetTimer()
TimerQueue.EnsureTimerFiresBy()
TimerQueueTimer.Change()
new TimerQueueTimer()
Timer.TimerSetup()
new Timer()
new SqliteConnectionFactory()
static SqliteConnectionFactory()
SqliteConnection.set_ConnectionString()
new SqliteConnection()
Logger.OpenDatabaseCore()
Logger.OpenDatabaseReadWrite()
Logger.Start()
async Program.Main()
AsyncMethodBuilderCore.Start<{removed}.Program.<Main>d__0>()
Program.Main()
Program.<Main>()

@KalleOlaviNiemitalo
Copy link

This may be a duplicate of #93175.

Copy link
Contributor

Tagging subscribers to this area: @tarekgh, @tommcdon, @pjanotti
See info in area-owners.md if you want to be subscribed.

@tommcdon tommcdon added this to the 10.0.0 milestone Aug 29, 2024
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Aug 29, 2024
@mdh1418
Copy link
Member

mdh1418 commented Sep 23, 2024

I believe this is fixed by #105548. I was able to repro the issue on .NET Sdk 8.0.300, and using a local build of runtime with the PR I could not trigger the deadlock. This issue seemed to have the same underlying issue as #105682, which I based the repro on.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants