Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeepAlive: adds handling for cases when monitoring may be stopped #578

Merged
merged 5 commits into from
Jun 7, 2018

Conversation

aoberoi
Copy link
Contributor

@aoberoi aoberoi commented Jun 6, 2018

There has bee a crash reported that looks kind of like this:

Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      | [ERROR] @slack/client:RTMClient A websocket error occurred: read ECONNRESET
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      | Error: no client found
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at KeepAlive.attemptAcknowledgePong (/home/node_modules/@slack/client/dist/KeepAlive.js:92:54)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at RTMClient.emit (/home/node_modules/eventemitter3/index.js:182:35)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at RTMClient.onWebsocketMessage (/home/node_modules/@slack/client/dist/RTMClient.js:447:14)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at WebSocket.onMessage (/home/node_modules/ws/lib/event-target.js:120:16)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at emitOne (events.js:96:13)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at WebSocket.emit (events.js:188:7)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at Receiver._receiver.onmessage (/home/node_modules/ws/lib/websocket.js:137:47)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at Receiver.dataMessage (/home/node_modules/ws/lib/receiver.js:409:14)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at Receiver.getData (/home/node_modules/ws/lib/receiver.js:347:12)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at Receiver.startLoop (/home/node_modules/ws/lib/receiver.js:143:16)
Jun 05 17:02:45 534971b042ee ecs-rtmworker-20-rtmworker-b8d9f1e6eec3b390f201: 00:02:45 1|rtm      |     at Receiver.add (/home/node_modules/ws/lib/receiver.js:117:10) 

After tracing how this might occur, one bad sequence of events was found: when attemptAcknowledgePong() gets called before the serverPongTimeout but after stop() is already called, and start() has not been called yet.

While the client is reconnecting, during the resuming state, incoming messages trigger attemptAcknowledgePong() (it was never removed from the EventEmitter) and then we run into an inconsistent state error. Breaking this into a separate method gives us the ability to remove it from the EventEmitter during the call to stop().

After that issue was found, it became apparent that in all other places where we throw an inconsistent state error, we may similarly just be seeing a situation where monitoring was turned off before that callback fired. This PR also contains code to check isMonitoring in all of those situations, because its usually okay and just means we should return early.

Requirements (place an x in each [ ])

@codecov
Copy link

codecov bot commented Jun 6, 2018

Codecov Report

Merging #578 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #578   +/-   ##
=======================================
  Coverage   86.78%   86.78%           
=======================================
  Files           6        6           
  Lines         280      280           
  Branches       43       43           
=======================================
  Hits          243      243           
  Misses         26       26           
  Partials       11       11
Impacted Files Coverage Δ
src/errors.ts 100% <ø> (ø) ⬆️
src/logger.ts 87.87% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 008402b...265aa97. Read the comment docs.

@aoberoi
Copy link
Contributor Author

aoberoi commented Jun 6, 2018

i disconnected my network connection to observe what happens. i don't see a crash, but i do some odd behavior:

[DEBUG] @slack/client:KeepAlive received pong, clearing pong timer
[DEBUG] @slack/client:KeepAlive ping timer expired, sending ping
[DEBUG] @slack/client:RTMClient send() in state: connected,ready
[DEBUG] @slack/client:KeepAlive setting ping timer
[DEBUG] @slack/client:RTMClient sending message on websocket: {"type":"ping","id":48}
[DEBUG] @slack/client:KeepAlive setting pong timer

--- disconnect network connection ---

[DEBUG] @slack/client:KeepAlive pong timer expired, recommend reconnect
[DEBUG] @slack/client:KeepAlive ping timer expired, sending ping
[DEBUG] @slack/client:RTMClient send() in state: connected,ready
[DEBUG] @slack/client:RTMClient sending message on websocket: {"type":"ping","id":49}
[ERROR] @slack/client:RTMClient failed to send message on websocket: WebSocket is not open: readyState 2 (CLOSING)
[ERROR] @slack/client:KeepAlive Unhandled error: Failed to send message on websocket: WebSocket is not open: readyState 2 (CLOSING). Please report to @slack/client package maintainers.

--- reconnect network connection ---

[DEBUG] @slack/client:RTMClient received message on websocket: {"type":"pong","reply_to":48}
[DEBUG] @slack/client:RTMClient transitioning to state: reconnecting
[DEBUG] @slack/client:RTMClient transitioning to state: connecting

immediately after the 'recommend reconnect', i'd expect to see the RTMClient transition to reconnecting but instead it stays in the connected,ready state. continuing to investigate.

@aoberoi
Copy link
Contributor Author

aoberoi commented Jun 6, 2018

currently, the RTMClient depends on the behavior of calling websocket.close() results in the websocket object emitting a 'close' event. however, this doesn't appear to work.

i'm not sure whether this behavior is correct to expect. i can't find much in the spec about this, but from my read of the following test code, i think it should work:

https://github.com/websockets/ws/blob/bf9b2ececbe42dd07ef9619d2b4953f57243c843/test/websocket.test.js#L241-L258

Update: actually, in the above test, its the server that initiates the closing of the websocket. according to the protocol specification, clients "SHOULD NOT" close the connection (but still may): https://tools.ietf.org/html/rfc6455#section-7.3. however, according to the DOM API, the close event should be fired: https://html.spec.whatwg.org/multipage/web-sockets.html#closeWebSocket

Update 2: i think it is expected behavior: https://github.com/websockets/ws/blob/bf9b2ececbe42dd07ef9619d2b4953f57243c843/test/websocket.test.js#L1194-L1210. there's something to debug here.

@aoberoi
Copy link
Contributor Author

aoberoi commented Jun 6, 2018

i think i figured it out! when you call websocket.close(), that initiates the close handshake. but that actually requires sending a close frame on the wire before the close event can fire. if the connection is truly down, that's not going to happen (at least anytime soon). therefore, stopping the KeepAlive monitoring only after the close event fires means theres plenty of time to send new pings that essentially have no purpose. the last commit in this PR just preemptively stops the monitoring so that there aren't issues sending these futile pings. we've already committed to tearing down the websocket connection, so a keep-alive process doesn't make any sense.

@aoberoi aoberoi requested a review from shaydewael June 6, 2018 07:03
*/
public start(client: RTMClient): void {
this.logger.debug('start monitoring');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make these a little less vague so they make more sense in the context of all the other debug logging an app will be doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is okay since each log line is prepended with the object name (e.g. KeepAlive), and in that context, monitoring only really has one meaning.

@aoberoi aoberoi merged commit 3c8789a into slackapi:master Jun 7, 2018
@aoberoi aoberoi deleted the keepalive-inconsistency-issues branch June 11, 2018 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants