-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIX: Intermittent failure on pseudo-tty/test-tty-wrap #9728
Comments
/cc @Fishrock123 |
I am very confused. What? How? |
Oh I see, no output is produced? Oh dear. I guess you've triggered the exit bug with load alone?? ¯\_(ツ)_/¯ |
cc/ @nodejs/platform-aix @Akpotohwo |
Passed 200 runs in stress build. |
Failed 24 times in 3100 on one of our aix machines... It's the same failure as shown earlier. |
Good to hear you can recreate. |
Failed again in the build today. https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/2433/ @Akpotohwo any update on the investigation ? |
Again today: https://ci.nodejs.org/job/node-test-commit-aix/2736/nodes=aix61-ppc64/console: length differs.
expect=2
actual=0
patterns:
pattern = ^hello\ world\ 1$
pattern = ^hello\ world\ 2$
outlines:
not ok 1288 pseudo-tty/test-tty-wrap
---
duration_ms: 0.158
severity: fail
stack: |- |
@Fishrock123 (or anyone else that knows): Can you elaborate a bit on "the exit bug"? Is there a link or something? |
Should this be marked as flaky on AIX similar to |
@Trott Given that we have seen relatively regularly I'll plan to ahead and mark it as flaky. |
@gibfahn I'll submit a PR to excluded, can you follow up on where our investigation stands on this one. |
We have had nodejs#9728 open for a while but the frequency of the failures seems to be such that we should mark it as flaky while we continue to investigate.
PR to exclude: https://ci.nodejs.org/job/node-test-pull-request/5702/ |
PR: #10618 |
We have had #9728 open for a while but the frequency of the failures seems to be such that we should mark it as flaky while we continue to investigate. PR-URL: #10618 Reviewed-by: Colin Ihrig <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: James M Snell <[email protected]>
@Akpotohwo if you're looking at pseudo-tty test failures on AIX it might be worth looking at #7973 as well. |
We have had nodejs#9728 open for a while but the frequency of the failures seems to be such that we should mark it as flaky while we continue to investigate. PR-URL: nodejs#10618 Reviewed-by: Colin Ihrig <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: James M Snell <[email protected]>
We have had nodejs#9728 open for a while but the frequency of the failures seems to be such that we should mark it as flaky while we continue to investigate. PR-URL: nodejs#10618 Reviewed-by: Colin Ihrig <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: James M Snell <[email protected]>
We have had nodejs#9728 open for a while but the frequency of the failures seems to be such that we should mark it as flaky while we continue to investigate. PR-URL: nodejs#10618 Reviewed-by: Colin Ihrig <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: James M Snell <[email protected]>
We have had nodejs#9728 open for a while but the frequency of the failures seems to be such that we should mark it as flaky while we continue to investigate. PR-URL: nodejs#10618 Reviewed-by: Colin Ihrig <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: James M Snell <[email protected]>
Investigated further, and got these inferences: There are 2 problems with the pseudo-tty tests: This subject test (tty-wrap) mostly undergoes (i) and occasionally, (ii). Root cause for (i): Passing behavior:
Failing behavior:
By the time the child is completely terminated, having its side of the fd closed. Subsequent read by the parent results in EIO, equivalent to a broken pipe. Hang root cause is not identified, but recreated in a small python + C test case: bash-4.3$ cat parent.py
#!/usr/bin/env python
import errno
import os
import pty
from subprocess import Popen, STDOUT
master_fd, slave_fd = pty.openpty()
proc = Popen(['./a.out'],stdout=slave_fd, close_fds=True)
os.close(slave_fd)
data = os.read(master_fd, 512)
print('got ' + repr(data)) bash-4.3$ cat child.cc
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
int main()
{
printf("hello world\n");
exit(0);
}
bash-4.3$
This issue is raised with python community: python 29545 , but no one seem to have picked it up yet. For (i): I see an issue with the test case - it assumes that the parent can read from the fd irrespective of the child's life span and exit status, however it works always in Linux, magically. AIX uses different process scheduling policies that Linux, and gets affected by this. We have seen many issues in other tests, failure reason a function of the order of execution of child vs. parent soon after the fork. A 100 ms delay after the child write and before exit consistently resolves the issue. This test and no_dropped_stdio.js - 7973 are affected by this. @mhdawson @Fishrock123 @gibfahn - please let me know what you think, if you agree I will come up with a PR to this effect. |
@gireeshpunathil if a 100ms delay fixes this consistently on AIX then that seems reasonable to me. |
Is there a way to have the parent signal the child when it has completed its write so that we can avoid a timeout ? For example if the parent could write 'done' to the childs stdin when it had read. Delays are always problematic so want to be sure we don't have any other way to achieve the same result. |
@mhdawson - while it is technically possible for the parent to acknowledge the data and make it a mandate to receive the ack for child to exit, the python parent (testcfg.py) is the main drver also for the rest of the tests in test/pseudo-tty, furthermore it delegates the child spawning work to tools/test.py which is much more generic (drives the whole test). So I guess changing the parent logic to ack. can have side effects elsewhere. |
So the approach of synchronizing between parent child has issues:
possibilities:
I am fine in either way, suggestions are much appreciated. |
I'm still +1 on adding a delay, it seems like a vastly simpler option. |
We have had #9728 open for a while but the frequency of the failures seems to be such that we should mark it as flaky while we continue to investigate. PR-URL: #10618 Reviewed-by: Colin Ihrig <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: James M Snell <[email protected]>
We have had #9728 open for a while but the frequency of the failures seems to be such that we should mark it as flaky while we continue to investigate. PR-URL: #10618 Reviewed-by: Colin Ihrig <[email protected]> Reviewed-By: Rich Trott <[email protected]> Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: James M Snell <[email protected]>
@gireeshpunathil so to clarify, would adding a delay fix the issue for |
This hasn't recurred in some time. Feel free to re-open (or leave a comment requesting that it be re-opened) if you disagree, but I'm inclined to close at this time. I'm just tidying up and not acting on a super-strong opinion or anything like that. |
This is still referenced in
If we close it, shouldn't we also remove the reference to it? Or perhaps change the ref to nodejs/build#1820? Or perhaps, I misunderstand the meaning of |
@sam-github Yes, you are correct: That entry should be removed. #28129 |
The test is believed to no longer be unreliable on AIX. Remove the flaky designation from the appropriate status file. Closes: nodejs#9728 PR-URL: nodejs#28129 Fixes: nodejs#9728 Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]> Reviewed-By: Trivikram Kamat <[email protected]> Reviewed-By: James M Snell <[email protected]>
The test is believed to no longer be unreliable on AIX. Remove the flaky designation from the appropriate status file. Closes: #9728 PR-URL: #28129 Fixes: #9728 Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Luigi Pinca <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]> Reviewed-By: Trivikram Kamat <[email protected]> Reviewed-By: James M Snell <[email protected]>
Seen a few intermittent failures in the pseudo-tty/test-tty-wrap test today
https://ci.nodejs.org/job/node-test-commit-aix/1998/nodes=aix61-ppc64/
https://ci.nodejs.org/job/node-test-commit-aix/1988/nodes=aix61-ppc64/
The text was updated successfully, but these errors were encountered: