Websocket state errors with production usage #550

bmajz · 2018-05-07T18:28:52Z

Description

Using this in production, we often get websocket state errors that lead to us having to take the connection down. Here are is an example (not the frames may not be in order due to how our logs are processed):

[ERROR] @slack/client:RTMClient A websocket error occurred: Unexpected server response: 503 
Error: Unhandled event 'websocket close' in state 'connecting'. 
    at /home/node_modules/finity/lib/core/HierarchicalStateMachine.js:79:33 
    at StateMachine.handleUnhandledEvent (/home/node_modules/finity/lib/core/StateMachine.js:76:13) 
    at TaskScheduler.execute (/home/node_modules/finity/lib/core/TaskScheduler.js:29:7) 
    at WebSocket.websocket.addEventListener.event (/home/node_modules/@slack/client/dist/RTMClient.js:391:77) 
    at HierarchicalStateMachine.handle (/home/node_modules/finity/lib/core/HierarchicalStateMachine.js:72:24) 
    at WebSocket.emit (events.js:191:7) 
    at TaskScheduler.enqueue (/home/node_modules/finity/lib/core/TaskScheduler.js:19:12) 
    at emitTwo (events.js:106:13) 
    at WebSocket.onClose (/home/node_modules/ws/lib/event-target.js:124:16) 
    at WebSocket.finalize (/home/node_modules/ws/lib/websocket.js:187:12)

We are also regularly seeing this error appear in logs:
[ERROR] @slack/client:KeepAlive Unhandled error: Failed to send message on websocket: WebSocket is not open: readyState 2 (CLOSING). Please report to @slack/client package maintainers.

What type of issue is this? (place an `x` in one of the `[ ]`)

Requirements (place an `x` in each of the `[ ]`)

I've read and understood the Contributing guidelines and have done my best effort to follow them.
I've read and agree to the Code of Conduct.
I've searched for any related issues and avoided creating a duplicate issue.

Bug Report

Filling out the following details about bugs will help us solve your issue sooner.

Reproducible in:

@slack/client version: 4.2.0

node version: v6.10.3

OS version(s): alpine linux 3.7

Steps to reproduce:

Haven't been able to figure out specific repro steps unfortunately. Just these observations from production logs

Expected result:

No error

Actual result:

Errors

Attachments:

N/A

The text was updated successfully, but these errors were encountered:

aoberoi · 2018-05-08T00:30:04Z

@bmajz thanke for reporting. this seems like a tricky one to reproduce, but the 503 was a really good hint. @shanedewael and i spoke about possible causes and we identified a few things that should be handled better. i'll let him weigh in on the changes.

shaydewael · 2018-05-08T16:25:56Z

@bmajz We saw the errors you received as two different issues (both of which should be addressed). For the unhandled websocket close event, it seems related to #537. In the state machine, it looks like the websocket close event isn't properly handled in the authenticated substate of connecting. It will (hopefully) be a pretty simple fix.

When we looked into the second error you saw in KeepAlive, it seemed as if the stop() function wasn't clearing state properly. This seems less critical than the state machine bug, but should also be a fairly quick fix.

bmajz · 2018-05-09T04:09:29Z

Sweet — let me know when we’re good to retest! For now we jumped back to v3

anfilat · 2018-05-12T07:25:22Z

it still doesn't work with 4.2.2. I wrote simplest rtm bot

const { RTMClient } = require('@slack/client');

const SLACK_BOT_TOKEN = 'xoxb-...';

const rtm = new RTMClient(SLACK_BOT_TOKEN, {logLevel: 'debug'});
rtm.start();

rtm.on('message', (event) => {
	console.log(event);
});

and it cannot connect to Slack servers. Start log -

anfilat@afcomp:~/slack/node-js-getting-started$ node ind.js 
[DEBUG] @slack/client:WebClient initialized
[DEBUG] @slack/client:RTMClient transitioning to state: disconnected
[DEBUG] @slack/client:RTMClient pausing outgoing event queue
[DEBUG] @slack/client:RTMClient initialized
[DEBUG] @slack/client:RTMClient start()
[DEBUG] @slack/client:RTMClient transitioning to state: connecting
[DEBUG] @slack/client:RTMClient transitioning to state: connecting:authenticating
[DEBUG] @slack/client:WebClient apiCall() start
[DEBUG] @slack/client:WebClient request attempt
[DEBUG] @slack/client:RTMClient transitioning to state: connecting:authenticated
[ERROR] @slack/client:RTMClient A websocket error occurred: connect ETIMEDOUT 52.59.24.122:443
[DEBUG] @slack/client:RTMClient transitioning to state: connecting:reconnecting
anfilat@afcomp:~/slack/node-js-getting-started$

There is strange thing in transitioning to state: connecting:reconnecting I think there must be transitioning to state: reconnecting

aoberoi · 2018-05-12T11:43:44Z

@anfilat the logging is definitely strange, good catch! did the application crash? did you terminate the app? i don't see an error in your output.

aoberoi · 2018-05-12T12:28:19Z

I think the above PR should actually fix this issue now.

anfilat · 2018-05-13T04:41:20Z

It's the full log. Application is self-terminated. There is something wrong with transitions from a child to a parent machine.
Yes, PR-560 is usefull fix

dperez3 · 2018-06-01T23:26:53Z

This issue might still exist in some form or another. I'm not sure if there's a better place to put this, but I ran into a similar issue with version 4.2.0.

I had multiple async uses of the following method... (in typescript)

    async awaitResponseAsync():Promise<string> {
        var promise = new Promise<string>(async (resolve, reject) => {
            let rtmClient = new RTMClient(BOT_TOKEN);

            let channelId = await this._communicationEntity.getChannelIdAsync();

            rtmClient.start(null);
            // wait for response
            rtmClient.once(`message`, (event) => {
                console.log(event);
                if(event.channel == channelId) {
                    this._responseText = event.text;
                    resolve(event.text);
                    rtmClient.disconnect();
                }
            });
            rtmClient.once(`error`, (err) => {
                console.error(event);
                reject(err);
                rtmClient.disconnect();
            });
        });

        return promise;
    }

It seemed as though the happenings in 1 instance of RTMClient affected all other instances resulting in the other RTMClient instances no longer invoking subscribers when 1 instance was done with. Changing it to use on subscriptions instead of once fixed it...

    async awaitResponseAsync():Promise<string> {
        var promise = new Promise<string>((resolve, reject) => {
            let rtmClient = new RTMClient(BOT_TOKEN);

            rtmClient.start(null);
            // wait for response
            rtmClient.on(`message`, async (event) => {

                let channelId = await this._communicationEntity.getChannelIdAsync();

                if(event.channel == channelId) {
                    console.log(`Response received for ${this.communicationEntity.name}`, event);
                    this._responseText = event.text;
                    resolve(event.text);
                    rtmClient.disconnect();
                }
            });
            rtmClient.on(`error`, (err) => {
                console.error(event);
                reject(err);
                rtmClient.disconnect();
            });
        });

        return promise;
    }

aoberoi · 2018-06-03T04:34:38Z

@dperez3 i'm not sure I understand the problem you're describing. the two code snippets you provided seem identical (except for the additional console.log).

could you help me understand what the awaitResponseAsync() function is supposed to do? it seems like you might be trying to wait for a specific message you already sent to be acknowledged by listening to the RTM stream and waiting for it. if that's the case, i can offer a few improvements for you.

if you sent the message using RTM, you should just await the Promise returned from RTMClient.sendMessage(). For example:
```
(async () => {
  const rtm = new RTMClient(BOT_TOKEN);
  const response = await rtm.sendMessage('a simple message', channelId);
  // any code here executes after the message is sent
}());
```
the problem with how you've written the code currently is that the "message" event will trigger for all incoming messages that the bot can see. it seems like you're looking for a response to a very specific message in a specific channel. changing a .once() to a .on() is just helping your handler run on more messages, because the first one it sees might not be in the right channel.
if you don't use the previous suggestion, it still looks like you can eliminate wrapping your code in a new Promise(). you're already in an async function, which will always return a Promise, even when you just return a string.
in the next version, you don't need to pass in null as an argument to rtm.start(). we had an issue where we accidentally typed that argument as required instead of optional, but this will be resolved in the next release (likely to occur on Monday).

dperez3 · 2018-06-04T17:17:53Z

@aoberoi thank you for the quick response! I apologize, I did paste the wrong code in the second portion and I have updated it. However, you did seem to understand what I was trying to do pretty well anyway. The on vs once point you made makes total sense now. Of course once would result in it not receiving the intended message if it already heard an ignored message first.

This did break again today and ill respond to your other points now...

Again, great point with on vs once. However, I'm not trying to await the http success of a message being sent to a channel. That happens elsewhere via WebClient. Here I'm trying to receive a response to the message from a user on a channel. Or is that what the code you provided would actually do? If so, that'd be great! If not, is there a way to just listen for messages on a specific channel? (I'm gonna try it anyway though).
I suppose this is true if const response = await rtm.sendMessage('a simple message', channelId); returns a user response. I'm not sure how I could do this if I have to filter over the many possible messages I could receive until I get one from a specific channel though. Might be my inexperience with EventEmitters though.
Great! I'll be installing the update asap.

Edit: The fundamental thing I'm trying to do is asynchronously ask channels (dm and otherwise) a question and wait for an answer to the question per channel.

dperez3 · 2018-06-04T18:02:57Z

Update

I've updated to ^4.2.2 and modified my code for a bit of an experiment...

    async awaitResponseAsync():Promise<string> {
        return await new Promise<string>(async (resolve, reject) => {
            let rtmClient = new RTMClient(BOT_TOKEN);

            rtmClient.start(null);

            try {
                for (var i = 0; i < 30; i++) {
                    let rtmClient = new RTMClient(BOT_TOKEN);
                    rtmClient.start(null);
                    rtmClient.disconnect();
                }
            } catch (err) {
                console.error('Error', err);
            }

            // wait for response
            rtmClient.on(`message`, async (event) => {

                let channelId = await this.communicationEntity.getChannelIdAsync();

                if(event.channel == channelId) {
                    console.log(`Response received for ${this.communicationEntity.name}`, event);
                    this._responseText = event.text;
                    resolve(event.text);
                    rtmClient.disconnect();
                }
            });
            rtmClient.on(`error`, (err) => {
                console.error(event);
                reject(err);
                rtmClient.disconnect();
            });
        });
    }

The bit surrounded by a try catch generates this error...

Error Error: Unhandled event 'explicit disconnect' in state 'connecting'.
    at StateMachine.handleUnhandledEvent (~\node_modules\finity\lib\core\StateMachine.js:76:13)
    at ~\node_modules\finity\lib\core\HierarchicalStateMachine.js:79:33
    at TaskScheduler.execute (~\node_modules\finity\lib\core\TaskScheduler.js:29:7)
    at TaskScheduler.enqueue (~\node_modules\finity\lib\core\TaskScheduler.js:19:12)
    at HierarchicalStateMachine.handle (~\node_modules\finity\lib\core\HierarchicalStateMachine.js:72:24)
    at RTMClient.disconnect (~\node_modules\@slack\client\dist\RTMClient.js:284:27)
    at Promise (~\src\standup\meeting\response.js:27:31)
    at new Promise (<anonymous>)
    at Response.awaitResponseAsync (~\--\response.js:20:22)
    at Question.askAsync (~\src\--\question.js:32:49)

dperez3 · 2018-06-04T18:43:49Z

One issue I have found is that I am not waiting for RTMClient to be connected before listening. Another issue is... since I am spawning off multiple instances, instances start taking almost a minute to connect. A disconnect during this time would yield an error like the above. However, a disconnect would only happen if a message was received or an error occurred. If a message is being received before it actually connected that would lead me to believe that multiple instances get confused over some shared resource.

Anyway, I'm going to clean up some things like refactor for as little instances as possible since they can take so long to connect.

aoberoi · 2018-06-04T22:13:30Z

@dperez3 okay great, thanks for clarifying! now that i understand a little more about what you're trying to accomplish, i have some further suggestions.

Here I'm trying to receive a response to the message from a user on a channel. Or is that what the code you provided would actually do? If so, that'd be great! If not, is there a way to just listen for messages on a specific channel? (I'm gonna try it anyway though).

No, that's not how the code I provided works; the promise resolves when the message being sent is acknowledged by Slack. Your problem is different.

How did you send the message you are awaiting on for a response? If you sent the first message with an RTMClient, you shouldn't need to instantiate a whole new RTMClient for listening for the user's response message. You can just listen on that same instance used to send the message. However, if instead you sent the message with a WebClient or IncomingWebhook, then there's two more options I suggest: 1) use the Events API to receive incoming messages (see @slack/events-api module) or 2) instantiate the RTMClient object and connect before you send the message (then there's no chance that you'd miss the message). In all of these cases, you want match messages generally (using .on() instead of .once()) and use a condition to check if its a message you need (e.g. check if its from the right channel).

i don't understand the experiment above. if you start an RTM connection and then disconnect before it finishes, then its a useless RTMClient (but I do think we should do better and not throw an error - i'll create a separate issue for that). why would you want many connections with the same token? if its because you are waiting for separate responses concurrently, then i think it would be better to store information about the separate responses you are waiting for, and then use the message handler to check against that set. without knowing too much more about your app, here's an example that might be useful:

const rtm = new RTMClient(BOT_TOKEN);
rtm.start(null);

// This object maps channel ID keys to values that are shaped as { promise, resolveFn, rejectFn }
const channelsAwaitingUserResponse = new Map();

rtm.on('message', (event) => {
  if (channelsAwaitingUserResponse.has(event.channel)) {
    const val = channelsAwaitingUserResponse.get(event.channel);
    val.resolveFn(event);
  }
});

async function sendMessageAndAwaitResponse(message, channelId) {
  await rtm.sendMessage(message, channelId);
  const userResponded = new Promise((resolve, reject) => {
    channelsAwaitingUserResponse[channelId] = {
      promise: userResponded,
      resolveFn: resolve,
      rejectFn: reject,
    };
  });
  return await userResponded;
};


// Now you can send a message and await for the user's response
const userResponse = await sendMessageAndAwaitResponse('hello world', 'C12345');

You'll notice that in this example, the RTMClient is only instantiated once, and stays connected for as long as the application is running. It also only has one message listener, which essentially "routes" the incoming message with an appropriate action. In more complex apps you might have many independent message handlers.

Note that the example above is limited to allowing only one "unresponded" message per channel. I'm not sure if this is a realistic constraint for your app, but it seems like you were already making that assumption in your code examples, so it carried over into my example. You could remove that constraint by creating a different condition to identify whether a specific message is a "response" to the message you send, besides just checking the channel ID.

bmajz · 2018-06-05T06:16:16Z

Just to return to the original topic at hand. I just got the following KeepAlive errors on 4.2.2

Jun 04 22:14:01 d3ace470e8cc ecs-rtmworker-19-rtmworker-f482e18cdabd8ed69f01: 05:14:01 1|rtm      | [ERROR] @slack/client:RTMClient failed to send message on websocket: WebSocket is not open: readyState 2 (CLOSING)
Jun 04 22:14:01 d3ace470e8cc ecs-rtmworker-19-rtmworker-f482e18cdabd8ed69f01: 05:14:01 1|rtm      | [ERROR] @slack/client:KeepAlive Unhandled error: Failed to send message on websocket: WebSocket is not open: readyState 2 (CLOSING). Please report to @slack/client package maintainers.

aoberoi added the bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented label May 11, 2018

aoberoi mentioned this issue May 11, 2018

Adds websocket close handlers in connecting:authenticated #555

Merged

2 tasks

aoberoi closed this as completed in #555 May 11, 2018

aoberoi reopened this May 12, 2018

aoberoi added the needs info An issue that is claimed to be a bug and hasn't been reproduced, or otherwise needs more info label May 12, 2018

aoberoi mentioned this issue May 12, 2018

handles transitions at parent machine instead of submachine #560

Merged

2 tasks

aoberoi closed this as completed in #560 May 15, 2018

aoberoi removed the needs info An issue that is claimed to be a bug and hasn't been reproduced, or otherwise needs more info label Jun 3, 2018

aoberoi mentioned this issue Jun 5, 2018

RTMClient handle transition for explicit disconnect from connecting #576

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Websocket state errors with production usage #550

Websocket state errors with production usage #550

bmajz commented May 7, 2018

aoberoi commented May 8, 2018

shaydewael commented May 8, 2018

bmajz commented May 9, 2018

anfilat commented May 12, 2018

aoberoi commented May 12, 2018

aoberoi commented May 12, 2018

anfilat commented May 13, 2018

dperez3 commented Jun 1, 2018 •

edited

Loading

aoberoi commented Jun 3, 2018 •

edited

Loading

dperez3 commented Jun 4, 2018 •

edited

Loading

dperez3 commented Jun 4, 2018 •

edited

Loading

dperez3 commented Jun 4, 2018 •

edited

Loading

aoberoi commented Jun 4, 2018

bmajz commented Jun 5, 2018

Websocket state errors with production usage #550

Websocket state errors with production usage #550

Comments

bmajz commented May 7, 2018

Description

What type of issue is this? (place an x in one of the [ ])

Requirements (place an x in each of the [ ])

Bug Report

Reproducible in:

Steps to reproduce:

Expected result:

Actual result:

Attachments:

aoberoi commented May 8, 2018

shaydewael commented May 8, 2018

bmajz commented May 9, 2018

anfilat commented May 12, 2018

aoberoi commented May 12, 2018

aoberoi commented May 12, 2018

anfilat commented May 13, 2018

dperez3 commented Jun 1, 2018 • edited Loading

aoberoi commented Jun 3, 2018 • edited Loading

dperez3 commented Jun 4, 2018 • edited Loading

dperez3 commented Jun 4, 2018 • edited Loading

Update

dperez3 commented Jun 4, 2018 • edited Loading

aoberoi commented Jun 4, 2018

bmajz commented Jun 5, 2018

What type of issue is this? (place an `x` in one of the `[ ]`)

Requirements (place an `x` in each of the `[ ]`)

dperez3 commented Jun 1, 2018 •

edited

Loading

aoberoi commented Jun 3, 2018 •

edited

Loading

dperez3 commented Jun 4, 2018 •

edited

Loading

dperez3 commented Jun 4, 2018 •

edited

Loading

dperez3 commented Jun 4, 2018 •

edited

Loading