Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix test-inspector-port-zero-cluster #13373

Merged
merged 1 commit into from
Jun 16, 2017

Conversation

refack
Copy link
Contributor

@refack refack commented Jun 1, 2017

Fixes: #13343

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines
Affected core subsystem(s)

test,cluster,inspector

@nodejs-github-bot nodejs-github-bot added dont-land-on-v4.x inspector Issues and PRs related to the V8 inspector protocol test Issues and PRs related to the tests. labels Jun 1, 2017
@Trott
Copy link
Member

Trott commented Jun 1, 2017

Does this fix a bug in the test or is this more of a workaround for a bug in Node.js? Seems like the latter to me, but maybe I'm wrong?

@refack
Copy link
Contributor Author

refack commented Jun 1, 2017

Stress (fedora24 / -j 64 inspector X 100) https://ci.nodejs.org/job/node-stress-single-test/1248/nodes=fedora24/

}
process.on('exit', () => {
ports.sort();
serialFork().then(serialFork).then(serialFork).then(common.mustCall(() => {
assert.strictEqual(ports.length, 3);
assert(ports.every((port) => port > 0));
assert(ports.every((port) => port < 65536));
assert.strictEqual(ports[0] + 1, ports[1]); // Ports should be consecutive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're in here fixing up the test, I wonder if we want to change this? It's not really guaranteed that the ports are sequential, is it? If nothing else, won't it wrap around to lower port numbers after 65535? I wonder if we want to change this to a check that we have three unique port numbers rather than sequential port numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are supposed to be sequential #13343 (comment)
It's a bug/feature of cluster https://github.com/nodejs/node/blob/master/lib/internal/cluster/master.js#L113

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooof, thanks for the pointer.

Copy link
Contributor Author

@refack refack Jun 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test might be more fragile since with --inspect=0 we ask for ports from the ephemeral range, and with --inspect=3993 it's a less used range.

@Trott
Copy link
Member

Trott commented Jun 1, 2017

The CI output in #13343 does not show two workers grabbing the same port so I'm not sure this is going to solve the issue. Maybe the worker that fails is conflicting with a pre-existing process on the host? In which case, I'm not sure there's anything we can do about it in the test itself?

} else {
process.on('message', (message) => {
process.on('message', common.mustCall((message) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common.mustCall() failing in a worker won't do anything, so I wouldn't include it here because it gives the reader a false sense that it's an effective check. (You can test this by passing common.mustCall(fn, 1000) or something like that.)

@refack
Copy link
Contributor Author

refack commented Jun 1, 2017

Does this fix a bug in the test or is this more of a workaround for a bug in Node.js? Seems like the latter to me, but maybe I'm wrong?

I think it narrows down the test-case, to make sure the ports are assigned and are sequential.
Maybe the code before [fork(), fork(), fork()] is more "real world" 🤔

The CI output in #13343 does not show two workers grabbing the same port so I'm not sure this is going to solve the issue. Maybe the worker that fails is conflicting with a pre-existing process on the host? In which case, I'm not sure there's anything we can do about it in the test itself?

Mark as flaky? Add a known issue (create a server listening to process.debugPort + 1 before a fork())

} else {
process.on('message', (message) => {
process.on('message', common.mustCall((message) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common.mustCall() failing in a worker won't do anything, so I wouldn't include it here because it gives the reader a false sense that it's an effective check. (You can test this by passing common.mustCall(fn, 1000) or something like that.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at in again and added cluster.on('exit') with assertions that the worker exited properly.

@Trott
Copy link
Member

Trott commented Jun 1, 2017

Maybe an OK workaround would be to check to see if the fork() failed and, if so, fork() again (perhaps checking that it failed because the address was in use before forking again)?

I'm +1 on creating a known_issues test while we're at it (if one doesn't already exist). And include a comment in this test indicating that the check-for-error-and-fork-again workaround can be removed once that known issue is fixed?

(Thanks for tackling this, by the way. I'd probably be going down a completely wrong rabbit hole.)

@refack
Copy link
Contributor Author

refack commented Jun 1, 2017

Maybe an OK workaround would be to check to see if the fork() failed

Yes, makes sense.

(Thanks for tackling this, by the way. I'd probably be going down a completely wrong rabbit hole.)

I was in the neighborhood with #12941

@refack
Copy link
Contributor Author

refack commented Jun 1, 2017

@Trott I did a mix-and-match

  1. forking is back to non-deterministic Promise.all()
  2. bind process.debugPort + 2 so worker2 should fail
  3. added a ton of assertions

@refack
Copy link
Contributor Author

refack commented Jun 1, 2017

P.S. test is still flaky unless we allow for more than 1 fail
...
Ohh, I have an idea

@Trott
Copy link
Member

Trott commented Jun 1, 2017

I don't feel so strongly about this that I'd stop this from landing, but: I think blocking one of the ports with a listening socket intentionally kind of changes what this test is about. This is happy-path testing and if we want to check for problems, we should probably set up a different test.

@Trott
Copy link
Member

Trott commented Jun 1, 2017

@nodejs/testing

@Trott
Copy link
Member

Trott commented Jun 1, 2017

I don't feel so strongly about this that I'd stop this from landing, but: I think blocking one of the ports with a listening socket intentionally kind of changes what this test is about. This is happy-path testing and if we want to check for problems, we should probably set up a different test.

Oops, never mind! It is an additional test case. Ignore me. Sorry for the noise.

@refack
Copy link
Contributor Author

refack commented Jun 1, 2017

Oops, never mind! It is an additional test case. Ignore me. Sorry for the noise.

Accept for asserting current behavior, I'm also not sure what this test is looking for.
I think I'll split it for testing proper worker setup (not actual connection), and the port clash as a known issue.

@refack refack changed the title test: serialize forking test: fix test-inspector-port-zero-cluster Jun 3, 2017
Copy link
Member

@bnoordhuis bnoordhuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #13343 (comment), if my hypothesis is correct, I don't think this PR will materially fix that while rather obscuring the actual feature under test.

cluster.on('online', common.mustCall((worker) => worker.send('debugPort'), 2));
cluster.on('error', (err) => assert.fail(err));
cluster.on('exit', (worker, code, signal) => {
assert.ok(!signal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert.strictEqual(signal, null)?

assert.strictEqual(ports[1] + 1, ports[2]);
const oneFailSentinal = common.mustCall();
cluster.on('online', common.mustCall((worker) => worker.send('debugPort'), 2));
cluster.on('error', (err) => assert.fail(err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be shortened to cluster.on('error', assert.fail) but is this event actually emitted? If yes, the documentation in cluster.md is incomplete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

} else {
const sentinal = common.mustCall();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: sentinel (ditto a few lines up.)

if (message !== 'debugPort') return;
process.send({ debugPort: process.debugPort });
sentinal();
process.exit(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't call process.exit(), call process.disconnect() and let the worker exit naturally.

@refack
Copy link
Contributor Author

refack commented Jun 3, 2017

See #13343 (comment), if my hypothesis is correct, I don't think this PR will materially fix that while rather obscuring the actual feature under test.

Ack. Re: #13373 (comment)
I think I'm going to turn this into a failing known_issue and leave it at that.

@refack refack force-pushed the fix-test-inspector-port-zero-cluster branch from 2b44385 to 90dadd9 Compare June 3, 2017 13:19
@refack
Copy link
Contributor Author

refack commented Jun 3, 2017

@refack
Copy link
Contributor Author

refack commented Jun 3, 2017

Gate is green, Full CI: https://ci.nodejs.org/job/node-test-pull-request/8458/

common.skipIfInspectorDisabled();
// Assert that even when started with `--inspect=0` workers are assigned
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix comment after CI finished

@refack
Copy link
Contributor Author

refack commented Jun 3, 2017

@bnoordhuis @Trott @jasnell @nodejs/testing I've split this in two:

  1. RFC: Original test would skip if detects a port clash. Not sure if it's better than marking as Flaky?
  2. new known_issue to demonstrate a deterministic port clash
    PTAL?

@Trott
Copy link
Member

Trott commented Jun 3, 2017

I think @bnoordhuis is saying that the behavior is not a bug but rather expected behavior. If so a known_issue test seems inappropriate. If the port must be 12345 and 12345 is in use by another process, there's nothing Node.js can do about that. Or am I missing something?

@refack
Copy link
Contributor Author

refack commented Jun 11, 2017

So as I see it:

  1. test/inspector/test-inspector-port-zero-cluster.js - works as designed
  2. test/known_issues/test-inspector-cluster-port-clash.js - known limitation, need to find a way to let user opt out
  3. test/known_issues/test-inspector-port-zero-cluster.js - just a bug that Fixed cluster inspect port logic #13619 fixes

I agree that (2) is an edge case (the real issue is #12941), and (3) has a fix so if you want I could kick them out.

@Trott
Copy link
Member

Trott commented Jun 11, 2017

I agree that (2) is an edge case (the real issue is #12941), and (3) has a fix so if you want I could kick them out.

If there's consensus that we really do want to provide a way for users to opt out, then my preference for 2 would be that we keep it after all but include a comment that explains something along the lines of "known limitation, currently working as intended, but we need a way to allow the user to opt out of sequential port assignment".

If another PR is going to fix the issue in #3, then the test can go right into that PR, but I don't object to it being here instead.


// There is an issue when forking workers that try to bind a port for inspector.
// If the port is not free the worker will fail.
// This test should fail with:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this comment be clarified a bit? This makes it sound like "If Node.js were working correctly and also addressing this issue, then this test should fail." Maybe instead of "should", something like "This test currently fails with:" or maybe something even more verbose like:

With the current behavior of Node.js (at least as late as 8.1.0), this test fails with the following error:
AssertionError [ERR_ASSERTION]: worker 2 failed to bind port
Ideally, there would be a way for the user to opt out of sequential port assignment and this test would then pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@refack
Copy link
Contributor Author

refack commented Jun 11, 2017

@refack refack mentioned this pull request Jun 13, 2017
3 tasks
@mutantcornholio
Copy link
Contributor

I can't say that it would fix this specific test flakiness, but what if testpy would check if all ports in the range of next common.PORT are free, and if not, use another range? Seems like a better solution to me.

This approach still has some probability of race-conditions: even if we would guarantee that [checking the port and starting the test] would be atomic, tests will not start to listen on the port right away. But if the port is taken by some system or long-enough-running process, it could do the trick.

Separating tests that use zero port and tests that use common.PORT should lower the probability of this, from the other side.

My opinion is we need to find a way to provide a stable infrastructure to the tests, not to rewrite the tests to the point they would be aware of the problem.

@Trott
Copy link
Member

Trott commented Jun 14, 2017

I can't say that it would fix this specific test flakiness, but what if testpy would check if all ports in the range of next common.PORT are free, and if not, use another range? Seems like a better solution to me.

We will never be able to completely guarantee that a specified port is available for a test before the test actually tries to use it.

So the task at hand is to determine a Good Enough solution.

I think the solution we have now is adequate and adding more layers of engineering will likely make things worse, not better.

Solution now is: If you are using common.PORT, put the test in sequential.

If someone slips up and puts the test in parallel, it will still work nearly 100% of the time. We'll probably get a failure once a month or less in CI, at which point we'll ignore it or realize what is going on and move it to sequential.

@bnoordhuis bnoordhuis dismissed their stale review June 15, 2017 10:00

PR changed considerably after last review

@mutantcornholio
Copy link
Contributor

mutantcornholio commented Jun 15, 2017

Wait, shouldn't we disable port incrementing behavior in port 0?

If we're letting OS decide which port to allocate, incrementing after that port is not a good idea IMHO. User have no control on that port range => it will lead to port collisions.

If master started with --inspect=0, let's start workers with --inspect=0 too.

@refack
Copy link
Contributor Author

refack commented Jun 15, 2017

Wait, shouldn't we disable port incrementing behavior in port 0?

If we're letting OS decide which port to allocate, incrementing after that port is not a good idea IMHO. User have no control on that port range => it will lead to port collisions.

If master started with --inspect=0, let's start workers with --inspect=0 too.

It was discussed and for now we believe that deterministic worker port allocation makes more sense. #13343 (comment)

Hopefully your second PR (manual override) will cover the other use cases.

[edit]
As for port collisions, the user should be aware that worker forking might fail, and retry anyway...

@mutantcornholio
Copy link
Contributor

Hopefully your second PR (manual override) will cover the other use cases.

Yeah, I think, users will be able to start master with --inspect=0 and use cluster.settings to set--inspect=0 to workers.

As for port collisions, the user should be aware that worker forking might fail, and retry anyway...

I thought that whole idea of port=0 is "OS, all I want is to avoid port collision".
But if you're guys sure this is the right way, sorry to bother :)

@refack
Copy link
Contributor Author

refack commented Jun 16, 2017

* re-implemented test to parse args instead of post binding (exit 12)
* saved failing case as known issue

PR-URL: nodejs#13373
Fixes: nodejs#13343
Reviewed-By: James M Snell <[email protected]>
@refack refack force-pushed the fix-test-inspector-port-zero-cluster branch from 81bc47f to fe2caf6 Compare June 16, 2017 11:14
@refack
Copy link
Contributor Author

refack commented Jun 16, 2017

test-inspector-port-zero-cluster.js removed since bug was fixed in 2777a7e
Pre land CI: https://ci.nodejs.org/job/node-test-commit/10603/

@refack refack merged commit fe2caf6 into nodejs:master Jun 16, 2017
@refack
Copy link
Contributor Author

refack commented Jun 16, 2017

Extra stress on master (30 cycles for all suites) https://ci.nodejs.org/job/node-stress-single-test/1306/nodes=rhel72-s390x/ ✔️

@refack refack deleted the fix-test-inspector-port-zero-cluster branch June 16, 2017 16:19
@refack
Copy link
Contributor Author

refack commented Jun 16, 2017

But if you're guys sure this is the right way, sorry to bother :)

@mutantcornholio IMHO your input is as valuable as anyone else's 👍
"Sure" 🤣 Me definitely not "sure", but it was given some thought, and currently that's the side of the tradeoff we estimate will make most sense.

addaleax pushed a commit that referenced this pull request Jun 17, 2017
* re-implemented test to parse args instead of post binding (exit 12)
* saved failing case as known issue

PR-URL: #13373
Fixes: #13343
Reviewed-By: James M Snell <[email protected]>
@addaleax addaleax mentioned this pull request Jun 17, 2017
addaleax pushed a commit that referenced this pull request Jun 21, 2017
* re-implemented test to parse args instead of post binding (exit 12)
* saved failing case as known issue

PR-URL: #13373
Fixes: #13343
Reviewed-By: James M Snell <[email protected]>
@addaleax addaleax mentioned this pull request Jun 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inspector Issues and PRs related to the V8 inspector protocol test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate flaky test-inspector-port-zero-cluster
7 participants