Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: fix platform shutdown deadlock #56827

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Qard
Copy link
Member

@Qard Qard commented Jan 30, 2025

Fixes #54918

Each worker is signalling its own completion of tasks independently and so should only be signalling for one corresponding drain otherwise the count of outstanding tasks goes out of sync and the process will never stop waiting for tasks when it should be exiting.

It just needs to be calling Signal rather than Broadcast.

Not sure if there was a reason for it to be a broadcast in the first place, but if so then the outstanding_tasks_ count adjustment needs to factor that in properly.

Each worker is signalling its own completion of tasks independently
and so should only be signalling for one corresponding drain otherwise
the count of outstanding tasks goes out of sync and the process will
never stop waiting for tasks when it should be exiting.

It just needs to be calling Signal rather than Broadcast.
@Qard Qard added confirmed-bug Issues with confirmed bugs. c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. v8 platform Issues and PRs related to Node's v8::Platform implementation. labels Jan 30, 2025
@nodejs-github-bot nodejs-github-bot added the needs-ci PRs that need a full CI run. label Jan 30, 2025
@Qard Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Jan 30, 2025
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jan 30, 2025
@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link

codecov bot commented Jan 30, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.20%. Comparing base (304bb9c) to head (d674a68).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #56827   +/-   ##
=======================================
  Coverage   89.20%   89.20%           
=======================================
  Files         663      663           
  Lines      192012   192012           
  Branches    36929    36933    +4     
=======================================
+ Hits       171286   171293    +7     
- Misses      13582    13590    +8     
+ Partials     7144     7129   -15     
Files with missing lines Coverage Δ
src/node_platform.cc 87.83% <100.00%> (ø)

... and 23 files with indirect coverage changes

@targos
Copy link
Member

targos commented Jan 30, 2025

This seems to bring new flakyness to --loader tests, which involve a worker.

@Qard
Copy link
Member Author

Qard commented Jan 30, 2025

I think it may also have not actually fixed the issue, just made it more rare. I've been doing a very long run locally here. Before this I encountered the issue at 9.9k iterations, with this it eventually popped up at 117k iterations.

@lpinca
Copy link
Member

lpinca commented Jan 30, 2025

Yes, unfortunately it does not seem to fix the issue:

$ python3 tools/test.py --repeat=1000 test/parallel/test-net-write-fully-async-hex-string.js
[00:35|%  98|+ 984|-   0]: release test-net-write-fully-async-hex-string
$ lldb -p 8578
(lldb) process attach --pid 8578
Process 8578 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007ff8111b1aaa libsystem_kernel.dylib`__psynch_cvwait + 10
libsystem_kernel.dylib`__psynch_cvwait:
->  0x7ff8111b1aaa <+10>: jae    0x7ff8111b1ab4 ; <+20>
    0x7ff8111b1aac <+12>: movq   %rax, %rdi
    0x7ff8111b1aaf <+15>: jmp    0x7ff8111af662 ; cerror_nocancel
    0x7ff8111b1ab4 <+20>: retq   
Target 0: (node) stopped.
Executable module set to "/Users/luigi/code/node/out/Release/node".
Architecture set to: x86_64-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff8111b1aaa libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff8111f07a8 libsystem_pthread.dylib`_pthread_cond_wait + 1193
    frame #2: 0x0000000104ffb893 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001041148bb node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::LibuvMutexTraits::cond_wait(cond=0x00007fdaa3204e78, mutex=0x00007fdaa3204e08) at node_mutex.h:175:5 [opt]
    frame #4: 0x00000001041148a1 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x00007fdaa3204e78, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x00000001041148a1 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::TaskQueue<v8::Task>::BlockingDrain(this=0x00007fdaa3204e08) at node_platform.cc:640:20 [opt]
    frame #6: 0x000000010411488b node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::WorkerThreadsTaskRunner::BlockingDrain(this=0x00007fdaa3204e08) at node_platform.cc:214:25 [opt]
    frame #7: 0x000000010411488b node`node::NodePlatform::DrainTasks(this=0x0000600002740000, isolate=<unavailable>) at node_platform.cc:467:33 [opt]
    frame #8: 0x0000000103f7ca60 node`node::SpinEventLoopInternal(env=0x00007fdaa5049c00) at embed_helpers.cc:44:17 [opt]
    frame #9: 0x00000001040d9b33 node`node::NodeMainInstance::Run() [inlined] node::NodeMainInstance::Run(this=<unavailable>, exit_code=<unavailable>, env=0x00007fdaa5049c00) at node_main_instance.cc:111:9 [opt]
    frame #10: 0x00000001040d9a99 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:100:3 [opt]
    frame #11: 0x00000001040414e6 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1488:24 [opt]
    frame #12: 0x0000000104041342 node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1495:27 [opt]
    frame #13: 0x00007ff810e612cd dyld`start + 1805
(lldb)

@Qard
Copy link
Member Author

Qard commented Jan 31, 2025

A bt all seems to point to a deadlock in the communication through WorkerThreadsTaskRunner::DelayedTaskScheduler to the TaskQueue worker threads. I see NodePlatform stuck on DrainTasks() while all the TaskQueue workers are stuck on BlockingPop(). Probably something to do with the timing around the ConditionVariable synchronization and how that interacts with the stopped_ state of the TaskQueue. 🤔

(lldb) bt all
  thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x000000010063d664 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::LibuvMutexTraits::cond_wait(cond=0x0000000143706ff8, mutex=0x0000000143706f88) at node_mutex.h:175:5 [opt]
    frame #4: 0x000000010063d658 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x0000000143706ff8, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x000000010063d658 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::TaskQueue<v8::Task>::BlockingDrain(this=0x0000000143706f88) at node_platform.cc:639:20 [opt]
    frame #6: 0x000000010063d640 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::WorkerThreadsTaskRunner::BlockingDrain(this=0x0000000143706f88) at node_platform.cc:211:25 [opt]
    frame #7: 0x000000010063d640 node`node::NodePlatform::DrainTasks(this=0x0000600002f0c0a0, isolate=<unavailable>) at node_platform.cc:466:33 [opt]
    frame #8: 0x00000001004cca50 node`node::SpinEventLoopInternal(env=0x0000000133810200) at embed_helpers.cc:44:17 [opt]
    frame #9: 0x000000010060a0a8 node`node::NodeMainInstance::Run() [inlined] node::NodeMainInstance::Run(this=<unavailable>, exit_code=<unavailable>, env=0x0000000133810200) at node_main_instance.cc:111:9 [opt]
    frame #10: 0x000000010060a014 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:100:3 [opt]
    frame #11: 0x000000010057fc98 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1488:24 [opt]
    frame #12: 0x000000010057fc0c node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1495:27 [opt]
    frame #13: 0x0000000199c94274 dyld`start + 2840
* thread #2
    frame #0: 0x0000000199fdbefc libsystem_kernel.dylib`kevent + 8
  * frame #1: 0x00000001013eb980 node`uv__io_poll(loop=0x000000014401dce8, timeout=-1) at kqueue.c:286:12 [opt]
    frame #2: 0x00000001013d83ec node`uv_run(loop=0x000000014401dce8, mode=UV_RUN_DEFAULT) at core.c:460:5 [opt]
    frame #3: 0x000000010063ddf4 node`node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Run(this=0x000000014401dc00) at node_platform.cc:95:5 [opt]
    frame #4: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #3
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x000000010063dfdc node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::LibuvMutexTraits::cond_wait(cond=0x0000000143706fc8, mutex=0x0000000143706f88) at node_mutex.h:175:5 [opt]
    frame #4: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x0000000143706fc8, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop(this=0x0000000143706f88) at node_platform.cc:617:22 [opt]
    frame #6: 0x000000010063ba90 node`node::(anonymous namespace)::PlatformWorkerThread(data=0x0000600001007c90) at node_platform.cc:42:61 [opt]
    frame #7: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #4
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x000000010063dfdc node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::LibuvMutexTraits::cond_wait(cond=0x0000000143706fc8, mutex=0x0000000143706f88) at node_mutex.h:175:5 [opt]
    frame #4: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x0000000143706fc8, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop(this=0x0000000143706f88) at node_platform.cc:617:22 [opt]
    frame #6: 0x000000010063ba90 node`node::(anonymous namespace)::PlatformWorkerThread(data=0x0000600001007cc0) at node_platform.cc:42:61 [opt]
    frame #7: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #5
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x000000010063dfdc node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::LibuvMutexTraits::cond_wait(cond=0x0000000143706fc8, mutex=0x0000000143706f88) at node_mutex.h:175:5 [opt]
    frame #4: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop() [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x0000000143706fc8, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x000000010063dfd0 node`node::TaskQueue<v8::Task>::BlockingPop(this=0x0000000143706f88) at node_platform.cc:617:22 [opt]
    frame #6: 0x000000010063ba90 node`node::(anonymous namespace)::PlatformWorkerThread(data=0x0000600001007cf0) at node_platform.cc:42:61 [opt]
    frame #7: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #6
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x000000010099e35c node`void heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl<void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0)::'lambda'()>(heap::base::Stack*, void*, void const*) [inlined] v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0::operator()(this=<unavailable>) const at collection-barrier.cc:125:18 [opt]
    frame #3: 0x000000010099e330 node`void heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl<void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0)::'lambda'()>(heap::base::Stack*, void*, void const*) [inlined] void v8::internal::LocalHeap::ParkAndExecuteCallback<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(this=0x000000017216e3a8, callback=(unnamed class) @ 0x0000600001428df0) at local-heap-inl.h:66:5 [opt]
    frame #4: 0x000000010099e308 node`void heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl<void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0)::'lambda'()>(heap::base::Stack*, void*, void const*) [inlined] void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(this=<unavailable>)::$_0)::'lambda'()::operator()() const at local-heap-inl.h:83:28 [opt]
    frame #5: 0x000000010099e2fc node`void heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl<void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0)::'lambda'()>(stack=0x0000000145012420, argument=<unavailable>, stack_end=<unavailable>) at stack.h:204:5 [opt]
    frame #6: 0x00000001013a9eb8 node`PushAllRegistersAndIterateStack + 40
    frame #7: 0x000000010099de20 node`v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*) [inlined] void v8::internal::LocalHeap::ExecuteWithStackMarker<void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0)::'lambda'()>(this=0x000000017216e3a8, callback=<unavailable>) at local-heap-inl.h:0 [opt]
    frame #8: 0x000000010099dddc node`v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*) [inlined] void v8::internal::LocalHeap::ExecuteWhileParked<v8::internal::CollectionBarrier::AwaitCollectionBackground(v8::internal::LocalHeap*)::$_0>(this=0x000000017216e3a8, callback=(unnamed class) @ 0x0000600001428e20) at local-heap-inl.h:82:3 [opt]
    frame #9: 0x000000010099dddc node`v8::internal::CollectionBarrier::AwaitCollectionBackground(this=0x0000600002f00140, local_heap=0x000000017216e3a8) at collection-barrier.cc:117:15 [opt]
    frame #10: 0x00000001009f2f70 node`v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [inlined] v8::internal::HeapAllocator::CollectGarbage(this=0x000000017216e3f8, allocation=kOld) at heap-allocator.cc:140:12 [opt]
    frame #11: 0x00000001009f2f58 node`v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(this=0x000000017216e3f8, size=16, allocation=kOld, origin=kRuntime, alignment=kTaggedAligned) at heap-allocator.cc:120:5 [opt]
    frame #12: 0x00000001009f3ac4 node`v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(this=0x000000017216e3f8, size=16, allocation=kOld, origin=kRuntime, alignment=kTaggedAligned) at heap-allocator.cc:148:7 [opt]
    frame #13: 0x00000001009c8170 node`v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCode(v8::internal::NewCodeOptions const&) [inlined] v8::internal::FactoryBase<v8::internal::LocalFactory>::AllocateRaw(this=0x000000017216e3a0, size=<unavailable>, allocation=kOld, alignment=kTaggedAligned) at factory-base.cc:1282:18 [opt]
    frame #14: 0x00000001009c8164 node`v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCode(v8::internal::NewCodeOptions const&) [inlined] v8::internal::FactoryBase<v8::internal::LocalFactory>::AllocateRawWithImmortalMap(this=0x000000017216e3a0, size=<unavailable>, allocation=kOld, map=Tagged<v8::internal::Map> @ x21, alignment=kTaggedAligned) at factory-base.cc:1273:31 [opt]
    frame #15: 0x00000001009c8164 node`v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCode(v8::internal::NewCodeOptions const&) [inlined] v8::internal::FactoryBase<v8::internal::LocalFactory>::NewWithImmortalMap(this=0x000000017216e3a0, map=Tagged<v8::internal::Map> @ x21, allocation=kOld) at factory-base.cc:1262:10 [opt]
    frame #16: 0x00000001009c815c node`v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCode(v8::internal::NewCodeOptions const&) [inlined] v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCodeWrapper(this=0x000000017216e3a0) at factory-base.cc:160:25 [opt]
    frame #17: 0x00000001009c8154 node`v8::internal::FactoryBase<v8::internal::LocalFactory>::NewCode(this=0x000000017216e3a0, options=0x000000017216d918) at factory-base.cc:82:39 [opt]
    frame #18: 0x00000001009d4b30 node`v8::internal::TorqueGeneratedFactory<v8::internal::Factory>::NewUncompiledDataWithPreparseData(v8::internal::Handle<v8::internal::String>, int, int, v8::internal::Handle<v8::internal::PreparseData>, v8::internal::AllocationType) [inlined] v8::internal::CombinedWriteBarrier(host=<unavailable>, slot=<unavailable>, value=<unavailable>, mode=<unavailable>) - 18446744069404275919 [opt]
    frame #19: 0x0000000100f23a5c node`v8::internal::maglev::MaglevCodeGenerator::BuildCodeObject(this=<unavailable>, local_isolate=<unavailable>) at maglev-code-generator.cc:1919:8 [opt]
    frame #20: 0x0000000100f21f84 node`v8::internal::maglev::MaglevCodeGenerator::Assemble(this=0x0000000143604700) at maglev-code-generator.cc:1710:9 [opt]
    frame #21: 0x0000000100f4d1c8 node`v8::internal::maglev::MaglevCompiler::Compile(local_isolate=0x000000017216e3a0, compilation_info=0x0000600002c005a0) at maglev-compiler.cc:208:36 [opt]
    frame #22: 0x0000000100fb1774 node`v8::internal::maglev::MaglevCompilationJob::ExecuteJobImpl(this=0x0000600002c00630, stats=<unavailable>, local_isolate=0x000000017216e3a0) at maglev-concurrent-dispatcher.cc:137:8 [opt]
    frame #23: 0x0000000100878c1c node`v8::internal::OptimizedCompilationJob::ExecuteJob(this=0x0000600002c00630, stats=<unavailable>, local_isolate=<unavailable>) at compiler.cc:485:22 [opt]
    frame #24: 0x0000000100fb2d14 node`v8::internal::maglev::MaglevConcurrentDispatcher::JobTask::Run(this=0x0000600001e2c260, delegate=0x000000017216ef08) at maglev-concurrent-dispatcher.cc:283:18 [opt]
    frame #25: 0x000000010181dcb4 node`v8::platform::DefaultJobWorker::Run(this=0x0000600001e2e920) at default-job.h:147:18 [opt]
    frame #26: 0x000000010063baa4 node`node::(anonymous namespace)::PlatformWorkerThread(data=0x0000600001007d20) at node_platform.cc:43:11 [opt]
    frame #27: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #7
    frame #0: 0x0000000199fd5db0 libsystem_kernel.dylib`semaphore_wait_trap + 8
    frame #1: 0x00000001013e68dc node`uv_sem_wait(sem=0x000000010561fc7c) at thread.c:492:9 [opt]
    frame #2: 0x00000001006f2930 node`node::inspector::(anonymous namespace)::StartIoThreadMain(unused=<unavailable>) at inspector_agent.cc:90:5 [opt]
    frame #3: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #8
    frame #0: 0x0000000199fd7a84 libsystem_kernel.dylib`__workq_kernreturn + 8
  thread #9, name = 'libuv-worker'
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001013d45f0 node`worker(arg=0x0000000000000000) at threadpool.c:77:7 [opt]
    frame #4: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #10, name = 'libuv-worker'
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001013d45f0 node`worker(arg=0x0000000000000000) at threadpool.c:77:7 [opt]
    frame #4: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #11, name = 'libuv-worker'
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001013d45f0 node`worker(arg=0x0000000000000000) at threadpool.c:77:7 [opt]
    frame #4: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136
  thread #12, name = 'libuv-worker'
    frame #0: 0x0000000199fd95cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000019a017894 libsystem_pthread.dylib`_pthread_cond_wait + 1204
    frame #2: 0x00000001013e6a60 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001013d45f0 node`worker(arg=0x0000000000000000) at threadpool.c:77:7 [opt]
    frame #4: 0x000000019a0172e4 libsystem_pthread.dylib`_pthread_start + 136

@Qard
Copy link
Member Author

Qard commented Jan 31, 2025

I suspect it's a spurious wakeup that's not handled safely and so the outstanding_tasks_ count goes out of sync.

@santigimeno
Copy link
Member

By looking at the traces, it seems to me that the problem is that one of the tasks in the Platform workers never ends thus DrainTasks never ends. In the traces from #56827 (comment), the Thread #6 is stuck in the cond_wait in heap::base::Stack::SetMarkerForBackgroundThreadAndCallbackImpl(), maybe a deadlock in v8?

@lpinca
Copy link
Member

lpinca commented Jan 31, 2025

According to V8 team, the issue is in Node.js, see https://issues.chromium.org/issues/374285493.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. confirmed-bug Issues with confirmed bugs. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. v8 platform Issues and PRs related to Node's v8::Platform implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock at process shutdown
6 participants