Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: refactor flaky net-error-twice #4062

Closed
wants to merge 2 commits into from

Conversation

Trott
Copy link
Member

@Trott Trott commented Nov 29, 2015

The test was not reliably creating the error event on Windows 2012.

This makes the test more robust by using setTimeout() to delay the
server writing to the socket to give time for socket.destroy() to take
effect.

Fixes: #4057

@Trott
Copy link
Member Author

Trott commented Nov 29, 2015

The test was not reliably creating the error event on Windows 2012.

This makes the test more robust by using setImmediate() to delay the
server writing to the socket to give time for socket.destroy() to take
effect.

Fixes: nodejs#4057
@Trott
Copy link
Member Author

Trott commented Nov 29, 2015

@Trott
Copy link
Member Author

Trott commented Nov 29, 2015

The stress test is going OK, but the CI still shows a failure for this test. I'd hate to have to delay the write() with a setTimeout() but that may be the workaround at least in the short term...

@mscdex mscdex added net Issues and PRs related to the net subsystem. test Issues and PRs related to the tests. labels Nov 29, 2015
@Trott
Copy link
Member Author

Trott commented Nov 29, 2015

// Windows-specific handling. See:
// * https://github.com/nodejs/node/pull/4062
// * https://github.com/nodejs/node/issues/4057
setTimeout(() => { conn.write(buf); }, 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the delay necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnoordhuis Concise-but-naive answer: Without the delay, the test is flaky on Windows 2012 r2 on the CI infrastructure.

Less concise and marginally less naive: In this test, conn.write(buf) is supposed to get an error (either EPIPE or ECONNRESET depending on timing). Unfortunately, on Windows 2012 r2 only, it appears that the call to client.destroy() often returns before the socket is set to emit the error. UPDATE: Actually, I think the problem is that the client connection and server connection callbacks fire in an unpredictable order. On Windows 2012 r2 only, the conn.write() would often not trigger an error and the result was assert.equal(errs.length, 1) firing (because errs.length is 0). I tried setImmediate() and that may have improved things but it did not resolve the flakiness. Using the timeout resolved the flakiness. (I did not experiment with a timeout of less than 100 milliseconds.)

As for the cause of this on Windows 2012 r2, I don't know and would certainly be receptive to some guidance on figuring it out (or someone simply doing the work and identifying the source of the issue). Operating system limitation? Bug in libuv? Bug in socket.destroy? I don't know. I don't have easy access to a Windows 2012 r2 machine. (I could try to run one via VirtualBox, but my laptop resources are kind of limited right now and my weak excuse is that it would probably fill up my disk.)

@Trott
Copy link
Member Author

Trott commented Dec 4, 2015

Bump. Would like to get this fix in because this test has been failing a lot lately.

@Trott
Copy link
Member Author

Trott commented Dec 6, 2015

Bump again, but this time with an @ mention: @nodejs/testing

This continues to fail a lot on Windows and I'd like to get this in as a fix. If this is not going to land, then a different PR can be submitted to at least mark the test as flaky. But I'd rather just see this land...

I suppose the best path would be to identify why this is happening on Windows. As I mention in a previous comment, I don't know if this is an OS issue, libuv issue, socket.destroy() issue, or what.

@bnoordhuis bnoordhuis added the windows Issues and PRs related to the Windows platform. label Dec 7, 2015
@bnoordhuis
Copy link
Member

@nodejs/platform-windows Can you investigate why the timeout fixes the test on Windows?

@joaocgreis
Copy link
Member

I haven't yet been able to reproduce locally. I have only seen this fail on msft machines, has it failed in a rackspace machine yet?

@joaocgreis
Copy link
Member

This reproduces every time in a fast single-core VM. Slowing down the VM (CPU cap) solves the problem. Only on Windows 2012. This might be a problem with node and not the test.

@Trott
Copy link
Member Author

Trott commented Dec 9, 2015

Maybe I was overthinking this. Is it possible that all that's happening is that sometimes the client connection callback is firing before the server connection callback and sometimes vice versa? So sometimes the socket is destroyed (which is what the test expects) and sometimes not?

It seems that all we'd have to do to fix this is wrap the conn.write() in a process.nextTick()... which seems to be working in the stress test. I thought I tried that and/or setImmediate() and it didn't quite work, but maybe I'm mistaken? Or maybe the that's working due to side effects of console logging I added. Let me re-try the stress test without the logging....

@joaocgreis If we change the setTimeout() to a process.nextTick(), does it reliably pass on the Windows machine where it had been reliably failing?

@Trott
Copy link
Member Author

Trott commented Dec 9, 2015

Alas, process.nextTick() really doesn't fix it so I really am not losing my mind (although I am filling up the issue tracker with needless walls of text. Sorry...).

@Trott
Copy link
Member Author

Trott commented Dec 9, 2015

OK, so back to the "why the timeout" question:

  • Is it plausible that all that is happening is that the client connection and server connection callbacks fire in an unpredictable order?
  • If so, then would it be acceptable just to use the setTimeout() (or, if we want to really be safe, a setInterval()) around the conn.write()?
  • If not, then any other ideas about what we're seeing here?

@Trott
Copy link
Member Author

Trott commented Dec 9, 2015

I now have access to a Windows machine and can iterate/test more efficiently. Ordering of callbacks is not the issue. Bummer. Starting to look/feel more and more like an issue with Node itself, as @joaocgreis suggested...

@Trott
Copy link
Member Author

Trott commented Dec 9, 2015

Even just introducing a single console.log() statement causes the test to start passing. In case that information helps someone figure this out so I don't have to...

@Trott
Copy link
Member Author

Trott commented Dec 9, 2015

Also interesting: If I remove require('common') (and hardcode the instances of common.PORT to 12346), then the test always passes!

@Trott
Copy link
Member Author

Trott commented Dec 10, 2015

This is getting weird: It appears that the culprit is adding the console global to the knownGlobals array in common.js on line 260. Comment out that line and set exports.globalCheck to false in line 332, and the test passes every time. (Changing exports.globalCheck alone is not enough. You need to get rid of console from the knownGlobals array.)

Any ideas as to why adding the console global variable to a private array inside the common module would be causing this issue? I understand how it could have side effects. I'm just not coming up with a plausible mechanism of how those side effects translate into this specific failure.

cc @bengl since I roped him in on this in between sessions at Node Interactive...

@mscdex
Copy link
Contributor

mscdex commented Dec 18, 2015

I think I found a solution that doesn't rely on a timeout, in #4342. Basically it waits until the server connection handler fires to destroy the client socket and write to the server socket. Doing it this way results in a single ECONNRESET every time on win2012r2.

@Trott
Copy link
Member Author

Trott commented Dec 18, 2015

I'd still like to know why adding console to an array in common.js causes this problem. That's just totally weird (and it was strangely gratifying to have João more-or-less confirm it independently).

@Trott
Copy link
Member Author

Trott commented Dec 19, 2015

Closing. @mscdex resolved it in #4342 and I opened #4358 to address the strange behavior that their fix works around.

@MylesBorins
Copy link
Contributor

removing lts watch tag as nothing landed in here

@Trott Trott deleted the net-error-twice branch January 13, 2022 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
net Issues and PRs related to the net subsystem. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants