Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket error occurred: Unexpected server response: 409 #2094

Open
gdhardy1 opened this issue Nov 3, 2024 · 21 comments · May be fixed by #2099
Open

WebSocket error occurred: Unexpected server response: 409 #2094

gdhardy1 opened this issue Nov 3, 2024 · 21 comments · May be fixed by #2099
Labels
backend Issue related to the Slack API bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented needs info An issue that is claimed to be a bug and hasn't been reproduced, or otherwise needs more info pkg:socket-mode applies to `@slack/socket-mode`

Comments

@gdhardy1
Copy link

gdhardy1 commented Nov 3, 2024

Socket mode receiver fails to make a successful connection due to WebSocket error occurred: Unexpected server response: 409 error. Retries after this error occurs may eventually lead to a successful connection or worse, rate limiting from Slack servers on the apps.connections.open call.

@slack/bolt version

4.0.1

Your App and Receiver Configuration

Socket mode 2.0.2

Node.js runtime version

v20.17.0

Steps to reproduce:

This error seems to reproduce only when launching in a distributed environment (i.e. launching 2 or more instances of the app).
Nothing special is being done on start up. It is minimally reproducible using the example from BoltJS Getting Started docs as a basis.

const app = new App({
  socketMode: true,
  token: process.env.BOT_TOKEN,
  signingSecret: process.env.SIGNING_SECRET,
  appToken: process.env.APP_TOKEN,
  logger
});

void (async () => {
  // Start your app
  await app.start(Number(process.env.PORT) || 3000);

  console.log('⚡️ Bolt app is running!');
})();
  1. Open multiple shell terminals (suggested 5) and change process.env.PORT to a different number in each shell
  2. Quickly start the bolt app in each shell (roughly simultaneously to mimic deployment as a server group)

Expected result:

Each instance of the app should start up successfully without the WebSocket error occurred: Unexpected server response: 409 error

Actual result:

First we see the Socket Mode client successfully initialized.

[14:05:55.636] DEBUG (36123): Initializing SocketModeReceiver
[14:05:55.636] DEBUG (36123): The logLevel given to Socket Mode was ignored as you also gave logger
[14:05:55.637] DEBUG (36123): The logLevel given to WebClient was ignored as you also gave logger
[14:05:55.637] DEBUG (36123): initialized
[14:05:55.637] DEBUG (36123): The Socket Mode client has successfully initialized
[14:05:55.637] DEBUG (36123): apiCall('auth.test') start
[14:05:55.638] DEBUG (36123): http request url: https://slack.com/api/auth.test
[14:05:55.638] DEBUG (36123): http request body: {"token":"[[REDACTED]]"}
[14:05:55.638] DEBUG (36123): http request headers: {"Accept":"application/json, text/plain, */*","User-Agent":"@slack:socket-mode/2.0.2 @slack:bolt/4.0.1 @slack:web-api/7.7.0 node/20.17.0 linux/6.8.0-1017-aws","Authorization":"[[REDACTED]]"}

Then we see the 409 error and the following logs repeat over and over again until:

  • the connection is established OR
  • the app is rate limited
[14:05:55.639] DEBUG (36123): Starting Socket Mode session ...
[14:05:55.639] DEBUG (36123): Going to retrieve a new WSS URL ...
[14:05:55.639] DEBUG (36123): apiCall('apps.connections.open') start
[14:05:55.639] DEBUG (36123): http request url: https://slack.com/api/apps.connections.open
[14:06:46.734] DEBUG (36239): http request body: {}
[14:06:46.734] DEBUG (36239): http request headers: {"Accept":"application/json, text/plain, */*","User-Agent":"@slack:socket-mode/2.0.2 @slack:bolt/4.0.1 @slack:web-api/7.7.0 node/20.17.0 linux/6.8.0-1017-aws","Authorization":"[[REDACTED]]"}
[14:06:46.755] DEBUG (36239): http response received
[14:06:46.755] DEBUG (36239): http request result: {"ok":true,"url":"wss://wss-primary.slack.com/link/?ticket=084fad38-f4fe-4fba-9057-fedc6792738f&app_id=4cd5f1068f00f78fe83741337c42c958d96b3b3b13a7e17b4cda776e289e362c","response_metadata":{"scopes":["connections:write"],"acceptedScopes":["connections:write"]}}
[14:06:46.755] DEBUG (36239): apiCall('apps.connections.open') end
[14:06:46.755] DEBUG (36239): Initiating new WebSocket connection.
[14:06:46.768] ERROR (36239): WebSocket error occurred: Unexpected server response: 409
[14:06:46.768] DEBUG (36239): WebSocket already disconnected, flushing remainder.
[14:06:46.768] DEBUG (36239): Before trying to reconnect, this client will wait for 5000 milliseconds
[14:06:46.768] ERROR (36239): WebSocket error! Error: Failed to send message on websocket: Unexpected server response: 409
[14:06:46.768] DEBUG (36239): WebSocket close frame received (code: 1006, reason: )
[14:06:46.768] DEBUG (36239): WebSocket already disconnected, flushing remainder.
[14:06:46.768] DEBUG (36239): Before trying to reconnect, this client will wait for 10000 milliseconds
[14:06:46.779] DEBUG (36239): Continuing with reconnect...

Additional Notes

It seems that a successfully established connection does not mean the app is in a "safe" state. While running, I've observed both Primary WebSocket error occurred: Unexpected server response: 409 and Secondary WebSocket error occurred: Unexpected server response: 409 errors as well. The behavior is basically identical to as described above.

I've also observed this on @slack/bolt-js versions as early as 3.20.0.

@gdhardy1
Copy link
Author

gdhardy1 commented Nov 3, 2024

Thanks in advance for investigating. I'm happy to provide any additional information. I'd appreciate any temporary workaround ideas in the short term. Being unexpectedly rate limited during business hours would be a huge disruption as the app where I experienced this is in the critical path for users seeking engineering support from my organization.

@gdhardy1
Copy link
Author

gdhardy1 commented Nov 5, 2024

I found some time to make a reproduction experiment for what I observed.

The issue seems to lie within in @slack/socket-mode.

  • When the SlackWebSocket encounters this 409 error, it calls to disconnect then it calls to terminate
  • This in turn emits a close event to the SocketModeClient and the client is started again

I see that the SocketModeClient is internal to SocketModeReceiver,autoReconnectEnabled is true by default, and this option is not configureable via AppOptions or SocketModeReceiverOptions. Additionally, isRecoverable is true for rate limited errors.

Given this, it seems that once a rate limit error is encountered, there isn't a way to recover and the call to apps.connections.open will be retried indefinitely.

@filmaj filmaj added question M-T: User needs support to use the project and removed untriaged labels Nov 5, 2024
@filmaj
Copy link
Contributor

filmaj commented Nov 5, 2024

Hey @gdhardy1 , thank you so much for the excellent reproduction materials 🙏 It makes it so much easier to try to help when the problem is clearly demonstrated ❤️

I acknowledge that it is frustrating to see this affecting your business app. Let's see if we can figure out a workaround or how we should go about getting this fixed.

A few things that stand out to me immediately:

  • The Slack tools, be it @slack/socket-mode or @slack/web-api, should be honouring any Retry-After headers in terms of reconnection attempts or rate limiting - are you not seeing that happen? Do you happen to have any debug logs for the 429 response?
  • If you did have access to any of the options you linked to, be it for socket-mode or bolt AppOptions or SocketModeReceiverOptions, would that help you in the short-term? If so, that's an easy fix/feature we can release pretty quickly to, at least, stem the bleeding.

I wonder what the proper behaviour here is. It feels to me like it would be reasonable to have the socket connection attempt to be re-established in a (presumably?) temporary issue like an HTTP 409.

Do you have any other thoughts on what expected behaviour should be in the unfortunate circumstance that socket endpoint is sending 409s?

@filmaj filmaj added bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented needs info An issue that is claimed to be a bug and hasn't been reproduced, or otherwise needs more info and removed question M-T: User needs support to use the project labels Nov 5, 2024
@gdhardy1
Copy link
Author

gdhardy1 commented Nov 6, 2024

The Slack tools, be it @slack/socket-mode or @slack/web-api, should be honouring any Retry-After headers in terms of reconnection attempts or rate limiting - are you not seeing that happen?

Not exactly. I believe there are two issues that lead to multiple retries being executed simultaneously.

Issue 1

When the socket connection fails with the 409 error, both the error and close events are emitted from the WebSocket

This leads to SocketModeClient calling delayReconnectAttempt twice for a single failure

  • the first at this.clientPingTimeoutMS * this.numOfConsecutiveReconnectionFailures and
  • the second at this.clientPingTimeoutMS * this.numOfConsecutiveReconnectionFailures + 1.

This happens because both the error and close event handlers call SocketModeClient.disconnect

To avoid this, SocketModeClient should check if a retry is pending before calling delayReconnectAttempt.

Issue 2

numOfConsecutiveReconnectionFailures is reset when apps.connections.open succeeds.

This can lead to retries being scheduled earlier than they should.

Consider the scenario where:

  • the socket connection fails and two retries are scheduled as described in Issue 1 then
  • the first retry starts but
    • the apps.connections.open succeeds and
      • numOfConsecutiveReconnectionFailures set to 0`
    • the socket connection fails
      • two more retries scheduled

The retry timing will be:

Time (ms) Event
0 apps.connections.open succeeds; socket connect fails
5000 Retry 1 starts; apps.connections.open succeeds; socket connection fails
10000 Retry 2 starts AND Retry 3 starts
15000 Retry 4 starts

You can replicate this by running the experiment with MAX_CALLS=4.

I suspect that either of following would be preferable:

However, that might not be necessary as long as Issue 1 is addressed.

Answers to other follow up questions.

Do you happen to have any debug logs for the 429 response?

I wasn't able to capture debug level logs when I encountered this in production. But the reproduction experiment is pretty close if not identical I think.

If you did have access to any of the options you linked to, be it for socket-mode or bolt AppOptions or SocketModeReceiverOptions, would that help you in the short-term?

Actually, after trying, I don't think so. I thought setting autoReconnectEnabled to false in conjunction with rejectRateLimitedCalls on the WebClient might help me catch and implement my own retry logic but that seemed like more effort than just fixing the root problem

@filmaj filmaj transferred this issue from slackapi/bolt-js Nov 6, 2024
@filmaj
Copy link
Contributor

filmaj commented Nov 6, 2024

Thanks for the analysis. I've moved this issue over to the node-slack-sdk (the home of the socket-mode client); will sift through everything you've posted and see if I can capture the issue as a test to help frame the discussion.

Separate issue: I think I will riff on your repro a bit to add Retry-After header responses to the apps.connections.open endpoint to see if retry handling works as expected. Will see what kind of responses that API raises in a rate limit scenario.

@bitofsky
Copy link

bitofsky commented Nov 10, 2024

image

Hello, I recently encountered the same issue.

When running two instances of a Bolt app, one of the apps would unexpectedly disconnect from the socket connection. During the reconnection process, it would fail repeatedly, doubling the reconnection attempts each time. This cycle continued until it led to an Out of Memory error, eventually causing the app to crash.

[ERROR]  socket-mode:SocketModeClient:0 WebSocket error! Error: Failed to send message on websocket: Unexpected server response: 409
[ERROR]  socket-mode:SlackWebSocket:1843 WebSocket error occurred: Unexpected server response: 409
[WARN]  bolt-app http request failed A rate limit was exceeded (url: apps.connections.open, retry-after: 2)

@filmaj
Copy link
Contributor

filmaj commented Nov 12, 2024

Update here: I have escalated this issue to the real-time server backend team to raise the issue related to unexpected 409 responses from that server.

I am still working on addressing the issue of what this socket-mode library should do in this situation and how to prevent it from getting into a rate-limit situation with the apps.connections.open API.

@filmaj filmaj added pkg:socket-mode applies to `@slack/socket-mode` backend Issue related to the Slack API labels Nov 12, 2024
@matthewhenry1
Copy link

@bitofsky upgrading to @slack/bolt 4.1.0 should resolve the Out of Memory error you're experiencing. I experienced the same Out of Memory error running 3.21.4

@bitofsky
Copy link

@matthewhenry1 Thanks for the response. I was running on @slack/bolt 4.0.1. I'll upgrade to 4.1.0 and test if the issue persists.

@filmaj
Copy link
Contributor

filmaj commented Nov 13, 2024

The underlying issue still exists in bolt 4.x, the only thing upgrading to 4.x may help is if you experience crashes with a class of errors in the form "unhandled event X in state Y", e.g. "Unhandled event 'server explicit disconnect' in state 'connecting'." (as described in this comment).

@alisd23
Copy link

alisd23 commented Nov 15, 2024

Just to add some more info: I'm also experiencing this issue in a distributed setup, and I'm also sharing app tokens across environments (local, staging, live) which is maybe a bad idea. I'm consistently getting 409 errors. Looking at the logs it seemingly starts with a random 409 response, then ends up in an out of control (seemingly exponential) chain reaction of reconnections - as others have seen. I might have to remove the integration as it's spamming our logs (and CPU resource) so much.

image

Given it's probably a distributed problem - could it be some connection limit issue? The original example of this issue uses 5 instances, which is less than the max 10 connection limit shown here: https://api.slack.com/apis/socket-mode#connections. Could this limit be the cause of the 409 errors, and could this limit be wrong in the docs - i.e. lower than it mentions? I'm fairly sure I don't have 10 instances in total at any time though.

Related question: Does anyone know if this limit is per app token or per app?

I'm using latest version 4.1.0.

@bitofsky
Copy link

bitofsky commented Nov 16, 2024

@matthewhenry1 Thanks for the response. I was running on @slack/bolt 4.0.1. I'll upgrade to 4.1.0 and test if the issue persists.

Issues are still occurring in version 4.1.0.

@filmaj
Copy link
Contributor

filmaj commented Nov 18, 2024

This does seem to be an infrastructural issue; from what I understand, the 409 response should be an internal-only signal for the real-time backend system to find another container to serve the connection request. For whatever reason (a recent change or bug), it is bubbling up to the public / to the apps. The infrastructure team is still investigating the cause.

@dschneck-ibm
Copy link

dschneck-ibm commented Nov 21, 2024

IBM is hitting this issue attempting to upgrade a major Slack application. Our only workaround was to decrease the number of instances of the application to 1. That is an unacceptable workaround. The reason we are upgrading is because of Unhandled event 'server explicit disconnect causing frequent, unexpected restarts of all the app instances. How can we proceed if web socket 1.x is a problem and web socket 2.x is a problem? We need fixes or workarounds.

@gdhardy1
Copy link
Author

We just experienced another outage today. Luckily, we had alerting set up so that we could mitigate within roughly 10-15 minutes. But honestly, even that amount of downtime is tremendously painful for our users.

In case it's helpful with isolating a change that introduced the bug, we started observing the 409 errors on Oct 28th or 29th. Before that time, these errors virtually never occurred.

What makes this especially painful is that recovery pretty much requires human intervention and downtime is guaranteed:

  • The first 409 error usually results in the same errors on other nodes
    • this leads to unresponsiveness first, then eventually rate limiting
  • When launching replacement nodes, they often come up in a bad state
    • Manually restarting "bad nodes" individually, yields the same result
  • We had to coordinate killing/restarting all nodes simultaneously to recover successfully from the rate limiting

@yichennnn36
Copy link

I also encountered the same issue. Because I have two instances of a Bolt app.
How could I resolve this problem? or how could I detect error and disconnected the Bolt app

[ERROR]  socket-mode:SocketModeClient:0 WebSocket error! Error: Failed to send message on websocket: Unexpected server response: 409
[ERROR]  socket-mode:SlackWebSocket:1843 WebSocket error occurred: Unexpected server response: 409
[WARN]  bolt-app http request failed A rate limit was exceeded (url: apps.connections.open, retry-after: 2)

@filmaj
Copy link
Contributor

filmaj commented Nov 22, 2024

Update: the backend team is still investigating the 409 issue. There has been some progress but it still being worked on.

I have made some progress on mitigating the double-reconnect logic identified by @gdhardy1 ("Issue 1" in their comment here). Will share a draft PR soon.

Still a few other things need to be untangled to make more meaningful progress: how should the socket library react to a 409 generally speaking? Should reconnects be disabled altogether, even if the consumer specifically turned reconnections on? Additionally, proper handling of WS endpoint reconnection rate limiting needs to be explored.

@filmaj filmaj linked a pull request Nov 22, 2024 that will close this issue
@filmaj
Copy link
Contributor

filmaj commented Nov 22, 2024

I have a draft PR that at least depicts the issue in an integration test in this repo: #2099

@filmaj
Copy link
Contributor

filmaj commented Nov 25, 2024

Another update: on Friday afternoon EST, a backend configuration problem was identified and we started partially rolling out a fix. Since then, we have seen metrics indicating that 409 errors are properly being consumed internally which should prevent them from being raised to applications. This fix will be rolled out fully today (Monday) and tomorrow, as so far it is only being partially rolled out to monitor impact on applications.

We expect the occurrence of 409s should have reduced since Friday and for them to be gone by Tuesday.

@gdhardy1
Copy link
Author

Thanks. Confirming that while yesterday I saw ~45,000, today I only two early in the AM.

@filmaj
Copy link
Contributor

filmaj commented Nov 26, 2024

Update: the backend fix has been rolled out to 75% and we expect to crank it up to 100% if not today then this week.

Also: as I mentioned, a PR is up with a proposed fix here: #2099

This fix is available as a pre-release as @slack/[email protected] under the npm dist tag 2.0.3rc (see socket-mode versions here: https://www.npmjs.com/package/@slack/socket-mode?activeTab=versions). Additionally, I have pre-released this pre-release in a bolt-js pre-release (yo dawg) as @slack/[email protected] under the npm dist tag 4.1.2rc (see bolt versions here: https://www.npmjs.com/package/@slack/bolt?activeTab=versions). For bolt-js devs affected by this bug, we would love if you could take either of these pre-releases for a spin and let us know how it works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Issue related to the Slack API bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented needs info An issue that is claimed to be a bug and hasn't been reproduced, or otherwise needs more info pkg:socket-mode applies to `@slack/socket-mode`
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants