-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows CI failures: parallel/test-trace-events-fs-sync #25512
Comments
Is 4da7e6e a candidate? That would be 6 hours before 7f91329, but as I said in the PR, I was a bit scared of how wide-reaching that change may be… |
@addaleax - it is not. I am able to recreate the issue with and without the said commit. My challenge is that I am unable to get userdump in my system (for some unknown reason, it is not producing one) |
Happy to report that it's now only happening frequently in CI, and not non-stop. 😅 |
So |
@refack - do you have guidance on how to enable |
semi-bisect: testing commit 92e95f1 which is the parent of 4da7e6e @gireeshpunathil best experience I had was with enabling silent-exit-dumps #13947 (comment) |
Few data points:
|
Some data points from nodejs/reliability#20
Example
|
at this point the effort is to get a dump; recreate seem to be easy. this fails only in windows, and we don't have a single dump to see what was the sate of the child process when it caused access violation. |
I made a minimal code that potentially recreate the same thing, and I got this:
However, user dump is not there to figure out the context. Either So, we are on the exit path for sure, but then a bad handle came on our way. Could it be that it is already destroyed? |
ok, modified the code to honor abort on exceptions, and got these info from the VS debugger (data copied from its UI frames, so not very much readable) message: call stack: node.exe!abort() Line 77 C++
node.exe!common_assert_to_stderr<wchar_t>(const wchar_t * const expression, const wchar_t * const file_name, const unsigned int line_number) Line 186 C++
node.exe!_wassert(const wchar_t * expression, const wchar_t * file_name, unsigned int line_number) Line 443 C++
> node.exe!uv_run(uv_loop_s * loop, <unnamed-enum-UV_RUN_DEFAULT> mode) Line 534 C
node.exe!uv__thread_start(void * arg) Line 104 C
node.exe!thread_start<unsigned int (__cdecl*)(void * __ptr64)>(void * const parameter) Line 115 C++
verifier.dll!00007ff93419507d() Unknown
kernel32.dll!00007ff94de88102() Unknown
ntdll.dll!00007ff94f56c5b4() Unknown loop content around endgame_handles:
so it is same as |
looking through the threads, 2 of them are relevant while others inert, just blocked on conditions. failing thread (0x3d7c, some helper thread?) node!abort+0x35
node!common_assert_to_stderr<wchar_t>+0xc8
node!_wassert+0x72
node!uv_run+0x3da
node!uv__thread_start+0x4d
node!thread_start<unsigned int (__cdecl*)(void * __ptr64)>+0x50
verifier!VerifierGetPropertyValueByName+0x1aefd
kernel32!BaseThreadInitThunk+0x22
ntdll!RtlUserThreadStart+0x34 main thread (0x280c) ntdll!ZwWaitForAlertByThreadId+0x14
ntdll!RtlSleepConditionVariableCS+0xd0
KERNELBASE!SleepConditionVariableCS+0x37
node!uv_cond_wait+0x10
node!node::tracing::NodeTraceBuffer::`scalar deleting destructor'+0x63
node!v8::platform::tracing::TracingController::Initialize+0x24
node!node::tracing::Agent::~Agent+0x74
node!node::Start+0x56a
node!wmain+0x1bc
node!__scrt_common_main_seh+0x10c
kernel32!BaseThreadInitThunk+0x22
ntdll!RtlUserThreadStart+0x34 |
will see if f39b3e3 is of any relevance. |
fwiw, the failing thread is identified as the Agent thread that runs the event loop on |
that leads to a point (of dilemma for me): is it fair / safe to execute edit: |
The code looks ok. We wait for the thread to join in I cant get this to reproduce on my box. Could you inspect the content of the handle that causes the crash? |
@bzoz - sure; been trying hard to get a repro. Will update when I have one. |
@gireeshpunathil I’m pretty sure the answer to that is “no”. |
Other runtimes do this (e.g. Electron & Julia), but in a thread-safe way (conditionally on a mutex or other locks). |
There is a uv_thread_join on the thread that runs the loop before the other thread gives the loop a spin. |
Oh, yeah, running it concurrently is the issue. I think consecutively running on different threads should not be an issue on its own. |
one challenge in getting the needed info is: right now the failing code is run as a child process, if we move it to the parent then it does not fail anymore; and if run as is it fails, but does not produce dumps. I am running with Is there a specific flag I could pass to |
ok, now given the problem site in localized, I covered the faulty lines in the only possibility I can think of is concurrent access to the if 2 threads needs to operate on the same loop object, I don't know how many places we will need exclusive access, but in this case, at least in |
Given that we precisely know the failing line, I patched libuv thus:
diff --git a/deps/uv/src/win/handle-inl.h b/deps/uv/src/win/handle-inl.h
index 82c657d579..88bc38470b 100644
--- a/deps/uv/src/win/handle-inl.h
+++ b/deps/uv/src/win/handle-inl.h
@@ -91,16 +91,32 @@ INLINE static void uv_want_endgame(uv_loop_t* loop, uv_handle_t* handle) {
handle->endgame_next = loop->endgame_handles;
loop->endgame_handles = handle;
+ fprintf(stderr, "uv_want_endgame, loop: 0x%p, handle: 0x%p, type: %d, requester: %d\n", loop, handle, handle->type, GetCurrentThreadId());
}
}
+inline DWORD GGG(LPEXCEPTION_POINTERS ptr)
+{
+ EXCEPTION_RECORD* record = ptr->ExceptionRecord;
+ fprintf(stderr, "code: %d, flags: %d\n", record->ExceptionCode, record->ExceptionFlags);
+ return EXCEPTION_EXECUTE_HANDLER;
+}
+
INLINE static void uv_process_endgames(uv_loop_t* loop) {
uv_handle_t* handle;
+
+
while (loop->endgame_handles) {
- handle = loop->endgame_handles;
- loop->endgame_handles = handle->endgame_next;
+ __try {
+ handle = loop->endgame_handles;
+ fprintf(stderr, "uv_process_endgame, loop: 0x%p, handle: 0x%p, type: %d, executor: %d\n", loop, handle, handle->type, GetCurrentThreadId());
+ loop->endgame_handles = handle->endgame_next;
+ }
+ __except(GGG(GetExceptionInformation())) {
+ fprintf(stderr, "surprise in the end game!\n");
+ }
handle->flags &= ~UV_HANDLE_ENDGAME_QUEUED; and got this: uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBED20, type: 13, requester: 117392
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEDC0, type: 2, requester: 117392
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEE38, type: 6, requester: 117392
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEEB0, type: 9, requester: 117392
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEF28, type: 2, requester: 117392
uv_process_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEF28, type: 2, executor: 117392
uv_process_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEEB0, type: 9, executor: 117392
uv_process_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEE38, type: 6, executor: 117392
uv_process_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEDC0, type: 2, executor: 117392
uv_process_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBED20, type: 13, executor: 117392
// no problem, one full cycle of end game
// from the many types of handles, this looks like the main loop
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00007FF620A951D0, type: 1, requester: 117392
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x0000025AD5BEB170, type: 1, requester: 117392
// 2 new parties registered for end game, but did not fire yet.
uv_want_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC1FB0, type: 1, requester: 62228
uv_want_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2090, type: 1, requester: 62228
uv_process_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2090, type: 1, executor: 62228
uv_process_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC1FB0, type: 1, executor: 62228
// failing thread (agent's thread) runs one cycle of its end game. All are async handles.
uv_want_endgame, loop: 0x0000025AD5BC9238, handle: 0x0000025AD5BC9408, type: 1, requester: 83608
uv_process_endgame, loop: 0x0000025AD5BC9238, handle: 0x0000025AD5BC9408, type: 1, executor: 83608
// some other loop, some other thread. don't know what it is.
uv_want_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2580, type: 1, requester: 62228
uv_want_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2660, type: 1, requester: 62228
uv_process_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2660, type: 1, executor: 62228
uv_process_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2580, type: 1, executor: 62228
// failing thread comes again, this time only to crash.
code: -1073741819, flags: 0
surprise in the end game! Observations:
Inferences:
|
@nodejs/libuv |
ok, so here we are! In addition to the above, add this one too: diff --git a/src/tracing/node_trace_buffer.cc b/src/tracing/node_trace_buffer.cc
index 3b7119f6d5..9b907a6a81 100644
--- a/src/tracing/node_trace_buffer.cc
+++ b/src/tracing/node_trace_buffer.cc
@@ -110,6 +110,7 @@ NodeTraceBuffer::NodeTraceBuffer(size_t max_chunks,
}
NodeTraceBuffer::~NodeTraceBuffer() {
+ fprintf(stderr, "destroying exit signal handle 0x%p\n", &exit_signal_);
uv_async_send(&exit_signal_);
Mutex::ScopedLock scoped_lock(exit_mutex_);
while (!exited_) { and we get here: uv_want_endgame, loop: 0x000002B61728B058, handle: 0x000002B61728B228, type: 1, requester: 11804
uv_process_endgame, loop: 0x000002B61728B058, handle: 0x000002B61728B228, type: 1, executor: 11804
...
destroying exit signal handle 0x000002B617283090
...
uv_want_endgame, loop: 0x000002B61727F618, handle: 0x000002B617282FB0, type: 1, requester: 103264
uv_want_endgame, loop: 0x000002B61727F618, handle: 0x000002B617283090, type: 1, requester: 103264
uv_process_endgame, loop: 0x000002B61727F618, handle: 0x000002B617283090, type: 1, executor: 103264
uv_process_endgame, loop: 0x000002B61727F618, handle: 0x000002B617282FB0, type: 1, executor: 103264
surprise in the end game!
Issue exists in all platforms, the crash will depend on how the C++ runtime deals with deleted memory. Some keep it aside for future uses, some unmap it, which case causes definite crash. From the symptom I assume Win 10 frees it up. As the handle is used for cross thread communication, it should outlive the Buffer object. cc @addaleax |
@gireeshpunathil It looks like you’ve struck gold, but I’m not quite sure that what you’re suspecting as the reason is correct:
However, there’s something that sticks out about your last log output: We tell the thread that deletes To confirm, running with this patch doesn’t make it crash on Linux for me, but it does make valgrind complain: @@ -176,9 +177,12 @@ void NodeTraceBuffer::ExitSignalCb(uv_async_t* signal) {
[](uv_handle_t* signal) {
NodeTraceBuffer* buffer =
reinterpret_cast<NodeTraceBuffer*>(signal->data);
+ {
Mutex::ScopedLock scoped_lock(buffer->exit_mutex_);
buffer->exited_ = true;
buffer->exit_cond_.Signal(scoped_lock);
+ }
+ sleep(1);
});
} |
Libuv does not guarantee that handles have their close callbacks called in the order in which they were added (and in fact, currently calls them in reverse order). This patch ensures that the `flush_signal_` handle is no longer in use (i.e. its close callback has already been run) when we signal to the main thread that `~NodeTraceBuffer` may be destroyed. Credit for debugging goes to Gireesh Punathil. Fixes: nodejs#25512
Suggested solution: addaleax/node@60311fe Stress test CI: https://ci.nodejs.org/job/node-stress-single-test/2146/ |
@addaleax - I will also run the test with your patch, as well as go try to understand the changes. |
@gireeshpunathil Just occurred to me that this might be somewhat easier to explain with your debug output: destroying exit signal handle 0x000002B617283090 // enter destructor
...
// uv_close() for flush_signal_
uv_want_endgame, loop: 0x000002B61727F618, handle: 0x000002B617282FB0, type: 1, requester: 103264
// uv_close() for exit_signal_
uv_want_endgame, loop: 0x000002B61727F618, handle: 0x000002B617283090, type: 1, requester: 103264
// uv_close() callback for exit_signal_
uv_process_endgame, loop: 0x000002B61727F618, handle: 0x000002B617283090, type: 1, executor: 103264
/* XXX The destructor continues NOW and memory is released,
* even though flush_signal_ still has a pending callback */
// uv_close() callback for flush_signal_
uv_process_endgame, loop: 0x000002B61727F618, handle: 0x000002B617282FB0, type: 1, executor: 103264
surprise in the end game! |
Another data point. From our CI, binaries created on https://ci.nodejs.org/computer/test-rackspace-win2008r2-x64-3/builds reproduce readily. Binaries from other machines not so much 😕 |
Strong evidence that #25896 fixes the issue. Good job y'all 🎩
C:\workspace\node-test-binary-windows>python tools\test.py --repeat 10000 parallel/test-trace-events-fs-sync
=== release test-trace-events-fs-sync ===
Path: parallel/test-trace-events-fs-sync
C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140
throw new Error(`${tr}:\n${util.inspect(proc)}`);
^
Error: fs.sync.utimes:
{ status: 3221225477,
signal: null,
output: [ null, '', '' ],
pid: 192,
stdout: '',
stderr: '' }
at Object.<anonymous> (C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140:11)
at Module._compile (internal/modules/cjs/loader.js:735:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:746:10)
at Module.load (internal/modules/cjs/loader.js:627:32)
at tryModuleLoad (internal/modules/cjs/loader.js:570:12)
at Function.Module._load (internal/modules/cjs/loader.js:562:3)
at Function.Module.runMain (internal/modules/cjs/loader.js:798:12)
at internal/main/run_main_module.js:27:11
Command: C:\workspace\node-test-binary-windows\Release\node.exe C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js
=== release test-trace-events-fs-sync ===
Path: parallel/test-trace-events-fs-sync
C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140
throw new Error(`${tr}:\n${util.inspect(proc)}`);
^
Error: fs.sync.fstat:
{ status: 3221225477,
signal: null,
output: [ null, '', '' ],
pid: 460,
stdout: '',
stderr: '' }
at Object.<anonymous> (C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140:11)
at Module._compile (internal/modules/cjs/loader.js:735:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:746:10)
at Module.load (internal/modules/cjs/loader.js:627:32)
at tryModuleLoad (internal/modules/cjs/loader.js:570:12)
at Function.Module._load (internal/modules/cjs/loader.js:562:3)
at Function.Module.runMain (internal/modules/cjs/loader.js:798:12)
at internal/main/run_main_module.js:27:11
Command: C:\workspace\node-test-binary-windows\Release\node.exe C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js
[00:27|% 0|+ 7|- 2]: Done
C:\workspace\node-test-binary-windows>python tools\test.py --repeat 10000 parallel/test-trace-events-fs-sync
[15:47|% 3|+ 307|- 0]: Done
C:\workspace\node-test-binary-windows> |
New crash with patch from #25896
But then who can be these? let us find out. |
|
@addaleax - I can confirm that the change is relevant to With the new patch, test failed once without enough context, that prompted me to look around for similar patterns. But could not get enough evidence on this, as the failure suddenly just stopped. Then I followed you suggestions in Linux (adding timing etc.) but did not see any failure. Tried this patch (in theory this should be ok, as we are on the back edge of deleting the object, so cleaning manually is fine) index 3b7119f..33e12e4 100644
--- a/src/tracing/node_trace_buffer.cc
+++ b/src/tracing/node_trace_buffer.cc
@@ -114,6 +114,12 @@ NodeTraceBuffer::~NodeTraceBuffer() {
Mutex::ScopedLock scoped_lock(exit_mutex_);
while (!exited_) {
exit_cond_.Wait(scoped_lock);
+ if (exited_) {
+ exit_signal_.type = (uv_handle_type) -1;
+ flush_signal_.type = (uv_handle_type) -1;
+ }
+
}
} and saw what we used to see in Windows - but was assertion failures on bad handle type - because I manually cleared it. Then I examined a So applying the same theory, I can confirm that On a hindsight, do we / should we have an option to direct |
@gireeshpunathil Yeah, thanks for catching that I think filling memory with garbage is something we'd only want to do in debug mode -- but maybe we can implement that in some way, yes? |
Libuv does not guarantee that handles have their close callbacks called in the order in which they were added (and in fact, currently calls them in reverse order). This patch ensures that the `flush_signal_` handle is no longer in use (i.e. its close callback has already been run) when we signal to the main thread that `~NodeTraceBuffer` may be destroyed. The same applies for `~NodeTraceWriter`. Credit for debugging goes to Gireesh Punathil. Fixes: #25512 Co-authored-by: Gireesh Punathil <[email protected]> PR-URL: #25896 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Refael Ackermann <[email protected]> Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Eugene Ostroukhov <[email protected]>
`json_trace_writer_` is protected by `stream_mutex_`, but one access to it was not guarded by a lock on said mutex. Refs: #25512 PR-URL: #25624 Reviewed-By: Denys Otrishko <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Richard Lau <[email protected]>
Libuv does not guarantee that handles have their close callbacks called in the order in which they were added (and in fact, currently calls them in reverse order). This patch ensures that the `flush_signal_` handle is no longer in use (i.e. its close callback has already been run) when we signal to the main thread that `~NodeTraceBuffer` may be destroyed. The same applies for `~NodeTraceWriter`. Credit for debugging goes to Gireesh Punathil. Fixes: #25512 Co-authored-by: Gireesh Punathil <[email protected]> PR-URL: #25896 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Refael Ackermann <[email protected]> Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Eugene Ostroukhov <[email protected]>
Libuv does not guarantee that handles have their close callbacks called in the order in which they were added (and in fact, currently calls them in reverse order). This patch ensures that the `flush_signal_` handle is no longer in use (i.e. its close callback has already been run) when we signal to the main thread that `~NodeTraceBuffer` may be destroyed. The same applies for `~NodeTraceWriter`. Credit for debugging goes to Gireesh Punathil. Fixes: #25512 Co-authored-by: Gireesh Punathil <[email protected]> PR-URL: #25896 Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Refael Ackermann <[email protected]> Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Eugene Ostroukhov <[email protected]>
possible related: nodejs/node#25512
Seeing 3221225477 (access violation) non-stop on win10 + vs2017 on CI in a trace_events test:
Examples:
https://ci.nodejs.org/job/node-test-binary-windows/23086/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=0/console
https://ci.nodejs.org/job/node-test-binary-windows/23085/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=0/
https://ci.nodejs.org/job/node-test-binary-windows/23084/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=0/
...and many others...started happening in the last 24 hours or so. Not sure if something changed in our code or if something changed on CI or what. First noted (to my knowledge) by @gireeshpunathil in #22712 (comment) and #22865 (comment).
That was rebased onto 7f91329. So if the problem is in our code (and not something that is only being surfaced now but has been there for a while or else something that is a problem with the CI host and not a problem with the test or code), then it would be either in that commit or one shortly before it.
@nodejs/trace-events
[refack]Added context - the above 3 fails are on 3 different workers. AFAICT all failures are similar and happen while testing
fchmod
Test call site: https://github.com/nodejs/node/blob/master/test/parallel/test-trace-events-fs-sync.js#L122
and setup site: https://github.com/nodejs/node/blob/master/test/parallel/test-trace-events-fs-sync.js#L33-L36
where this test case is six deep.
The text was updated successfully, but these errors were encountered: