-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: test-cluster-send-handle-large-payload intermittent failures #14747
Comments
Hm, I ran a stress test on the original PR because it had been flaky before, but that came back green, and a local stress test worked fine for me too, so I assume this is something platform-specific. So, uh, here are two stress tests to narrow that down a bit: OS X: https://ci.nodejs.org/job/node-stress-single-test/1367/ |
I think @bnoordhuis has experienced this too (#14730 (review)). He may have suggestions on reproducing it. For me it happens fairly infrequently (~1 in 20 runs). |
@nodejs/platform-macos … I don’t have access to OS X myself, but I am available to anybody who wants me do help debug my mistakes ;) |
I can replicate this locally. This might work on whatever your local OS is as well to replicate it: $ tools/test.py -j 96 --repeat 192 test/parallel/test-cluster-send-handle-large-payload.js
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
=== release test-cluster-send-handle-large-payload ===
Path: parallel/test-cluster-send-handle-large-payload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-cluster-send-handle-large-payload.js
--- TIMEOUT ---
[02:11|% 100|+ 179|- 13]: Done
$ First 96 tests pass in about 11 seconds, and then it pauses for two minutes as some of the tests (13 above) time out. Easiest/fastest solution would be to move the test to |
Every time somebody proposes that, I think either the test is broken or Node is broken. ;) I don’t think there’s a reason to assume this test should be flaky under load. |
Also, no, no luck reproducing under Linux, no matter how many parallel jobs. |
Looks like |
@Trott Any chance you could do some digging and figure out whether it’s the parent or the child process that’s hanging? And which handles are keeping it open (assuming it’s stuck with 0 % CPU, not 100 % CPU)? |
@addaleax Yes, looking right now...might have to stop for a few meetings, but will pick up again later if so... |
When this times out, the If I add a callback to Judging from #6767, I don't know why macOS would be more susceptible to missing the message than anything else. If this is not-a-bug behavior, I guess we can add a retry. If this is a bug... ¯\(ツ)/¯ The |
Okay, so this might be fix-able by adding an extra round-trip. I can try that on the weekend if nobody beats me to it. |
If the |
@Trott I think the discussion in the issue you linked means that, if the problem actually is “the process exits before process.send() can finish”, then yes, that may be things working as expected. (I’m not convinced that issue shouldn’t be re-opened, but that’s another story.) |
@addaleax Unfortunately, that doesn't appear to be what's going on here. :-( If I keep the event loop open in the child process, the message is still sometimes never received and the test times out. (Interestingly, if I add a useless EDIT: (Yeah, the "I have a fix ready to go" above...not so much.) |
That sounds weird, I don’t think I have heard of that and I’m not sure what that means.
|
Just so I don't forget what it was: Put a |
OK, good news is I have a simple fix. Bad news is that I suspect it covers up a legitimate bug. Not sure, though. Basically, if the child process waits a bit before sending the payload back, the message is always received, and the test passes reliably. That seems like a code smell to me. I thought I was pretty thorough in checking for events I might listen to that might seem like less a less code-smelly approach, but I'm going to go look again. |
On macOS, the parent process might not receive a message if it is sent to soon, and then subsequent messages are also sometimes not received. (Is this a bug or expected operating system behavior like the way a file watcher is returned before it's actually watching the file system on/ macOS?) Send a second message after a delay on macOS. While at it, minor refactoring to the test: * Blank line after loading `common` module per test-writing guide * Wrap arrow function in braces where implicit return is not needed * Remove unnecessary unref in subprocess Fixes: nodejs#14747
Fix (if it is indeed an OS quirk and not a bug) in #14780. |
On macOS, the parent process might not receive a message if it is sent to soon, and then subsequent messages are also sometimes not received. (Is this a bug or expected operating system behavior like the way a file watcher is returned before it's actually watching the file system on/ macOS?) Send a second message after a delay on macOS. While at it, minor refactoring to the test: * Blank line after loading `common` module per test-writing guide * Wrap arrow function in braces where implicit return is not needed * Remove unnecessary unref in subprocess PR-URL: #14780 Fixes: #14747 Reviewed-By: Anna Henningsen <[email protected]>
This test seems to be timing out sometimes. Introduced by #14588.
\cc @addaleax
The text was updated successfully, but these errors were encountered: