KeepAlive: adds handling for cases when monitoring may be stopped #578

aoberoi · 2018-06-06T02:34:08Z

There has bee a crash reported that looks kind of like this:

Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      | [ERROR] @slack/client:RTMClient A websocket error occurred: read ECONNRESET
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      | Error: no client found
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at KeepAlive.attemptAcknowledgePong (/home/node_modules/@slack/client/dist/KeepAlive.js:92:54)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at RTMClient.emit (/home/node_modules/eventemitter3/index.js:182:35)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at RTMClient.onWebsocketMessage (/home/node_modules/@slack/client/dist/RTMClient.js:447:14)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at WebSocket.onMessage (/home/node_modules/ws/lib/event-target.js:120:16)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at emitOne (events.js:96:13)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at WebSocket.emit (events.js:188:7)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at Receiver._receiver.onmessage (/home/node_modules/ws/lib/websocket.js:137:47)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at Receiver.dataMessage (/home/node_modules/ws/lib/receiver.js:409:14)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at Receiver.getData (/home/node_modules/ws/lib/receiver.js:347:12)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at Receiver.startLoop (/home/node_modules/ws/lib/receiver.js:143:16)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at Receiver.add (/home/node_modules/ws/lib/receiver.js:117:10)

After tracing how this might occur, one bad sequence of events was found: when attemptAcknowledgePong() gets called before the serverPongTimeout but after stop() is already called, and start() has not been called yet.

While the client is reconnecting, during the resuming state, incoming messages trigger attemptAcknowledgePong() (it was never removed from the EventEmitter) and then we run into an inconsistent state error. Breaking this into a separate method gives us the ability to remove it from the EventEmitter during the call to stop().

After that issue was found, it became apparent that in all other places where we throw an inconsistent state error, we may similarly just be seeing a situation where monitoring was turned off before that callback fired. This PR also contains code to check isMonitoring in all of those situations, because its usually okay and just means we should return early.

Requirements (place an `x` in each `[ ]`)

I've read and understood the Contributing Guidelines and have done my best effort to follow them.
I've read and agree to the Code of Conduct.

…ore the pong timeout related timers fire.

codecov · 2018-06-06T02:37:24Z

Codecov Report

Merging #578 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #578   +/-   ##
=======================================
  Coverage   86.78%   86.78%           
=======================================
  Files           6        6           
  Lines         280      280           
  Branches       43       43           
=======================================
  Hits          243      243           
  Misses         26       26           
  Partials       11       11

Impacted Files	Coverage Δ
src/errors.ts	`100% <ø> (ø)`	⬆️
src/logger.ts	`87.87% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 008402b...265aa97. Read the comment docs.

aoberoi · 2018-06-06T02:57:34Z

i disconnected my network connection to observe what happens. i don't see a crash, but i do some odd behavior:

[DEBUG] @slack/client:KeepAlive received pong, clearing pong timer
[DEBUG] @slack/client:KeepAlive ping timer expired, sending ping
[DEBUG] @slack/client:RTMClient send() in state: connected,ready
[DEBUG] @slack/client:KeepAlive setting ping timer
[DEBUG] @slack/client:RTMClient sending message on websocket: {"type":"ping","id":48}
[DEBUG] @slack/client:KeepAlive setting pong timer

--- disconnect network connection ---

[DEBUG] @slack/client:KeepAlive pong timer expired, recommend reconnect
[DEBUG] @slack/client:KeepAlive ping timer expired, sending ping
[DEBUG] @slack/client:RTMClient send() in state: connected,ready
[DEBUG] @slack/client:RTMClient sending message on websocket: {"type":"ping","id":49}
[ERROR] @slack/client:RTMClient failed to send message on websocket: WebSocket is not open: readyState 2 (CLOSING)
[ERROR] @slack/client:KeepAlive Unhandled error: Failed to send message on websocket: WebSocket is not open: readyState 2 (CLOSING). Please report to @slack/client package maintainers.

--- reconnect network connection ---

[DEBUG] @slack/client:RTMClient received message on websocket: {"type":"pong","reply_to":48}
[DEBUG] @slack/client:RTMClient transitioning to state: reconnecting
[DEBUG] @slack/client:RTMClient transitioning to state: connecting

immediately after the 'recommend reconnect', i'd expect to see the RTMClient transition to reconnecting but instead it stays in the connected,ready state. continuing to investigate.

aoberoi · 2018-06-06T06:20:13Z

currently, the RTMClient depends on the behavior of calling websocket.close() results in the websocket object emitting a 'close' event. however, this doesn't appear to work.

i'm not sure whether this behavior is correct to expect. i can't find much in the spec about this, but from my read of the following test code, i think it should work:

https://github.com/websockets/ws/blob/bf9b2ececbe42dd07ef9619d2b4953f57243c843/test/websocket.test.js#L241-L258

Update: actually, in the above test, its the server that initiates the closing of the websocket. according to the protocol specification, clients "SHOULD NOT" close the connection (but still may): https://tools.ietf.org/html/rfc6455#section-7.3. however, according to the DOM API, the close event should be fired: https://html.spec.whatwg.org/multipage/web-sockets.html#closeWebSocket

Update 2: i think it is expected behavior: https://github.com/websockets/ws/blob/bf9b2ececbe42dd07ef9619d2b4953f57243c843/test/websocket.test.js#L1194-L1210. there's something to debug here.

aoberoi · 2018-06-06T07:03:26Z

i think i figured it out! when you call websocket.close(), that initiates the close handshake. but that actually requires sending a close frame on the wire before the close event can fire. if the connection is truly down, that's not going to happen (at least anytime soon). therefore, stopping the KeepAlive monitoring only after the close event fires means theres plenty of time to send new pings that essentially have no purpose. the last commit in this PR just preemptively stops the monitoring so that there aren't issues sending these futile pings. we've already committed to tearing down the websocket connection, so a keep-alive process doesn't make any sense.

Roach · 2018-06-06T17:09:45Z

src/KeepAlive.ts

   */
  public start(client: RTMClient): void {
+    this.logger.debug('start monitoring');


Should we make these a little less vague so they make more sense in the context of all the other debug logging an app will be doing?

i think this is okay since each log line is prepended with the object name (e.g. KeepAlive), and in that context, monitoring only really has one meaning.

aoberoi added 2 commits June 5, 2018 19:27

KeepAlive: adds handling for cases when monitoring may be stopped bef…

1f49016

…ore the pong timeout related timers fire.

recommendReconnect is not optional

22dfc6f

improve jsdoc with argument descriptions (linter complained)

af9b713

aoberoi added 2 commits June 5, 2018 23:40

add some logging

5b4f91b

upgrade ws, stop monitoring on recommend reconnect

265aa97

aoberoi requested a review from shaydewael June 6, 2018 07:03

Roach reviewed Jun 6, 2018

View reviewed changes

Roach approved these changes Jun 6, 2018

View reviewed changes

aoberoi merged commit 3c8789a into slackapi:master Jun 7, 2018

aoberoi deleted the keepalive-inconsistency-issues branch June 11, 2018 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeepAlive: adds handling for cases when monitoring may be stopped #578

KeepAlive: adds handling for cases when monitoring may be stopped #578

aoberoi commented Jun 6, 2018

codecov bot commented Jun 6, 2018 •

edited

Loading

aoberoi commented Jun 6, 2018

aoberoi commented Jun 6, 2018 •

edited

Loading

aoberoi commented Jun 6, 2018

Roach Jun 6, 2018

aoberoi Jun 7, 2018

KeepAlive: adds handling for cases when monitoring may be stopped #578

KeepAlive: adds handling for cases when monitoring may be stopped #578

Conversation

aoberoi commented Jun 6, 2018

Requirements (place an x in each [ ])

codecov bot commented Jun 6, 2018 • edited Loading

Codecov Report

aoberoi commented Jun 6, 2018

aoberoi commented Jun 6, 2018 • edited Loading

aoberoi commented Jun 6, 2018

Roach Jun 6, 2018

Choose a reason for hiding this comment

aoberoi Jun 7, 2018

Choose a reason for hiding this comment

Requirements (place an `x` in each `[ ]`)

codecov bot commented Jun 6, 2018 •

edited

Loading

aoberoi commented Jun 6, 2018 •

edited

Loading