-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebSocket error occurred: Unexpected server response: 409 #2094
Comments
Thanks in advance for investigating. I'm happy to provide any additional information. I'd appreciate any temporary workaround ideas in the short term. Being unexpectedly rate limited during business hours would be a huge disruption as the app where I experienced this is in the critical path for users seeking engineering support from my organization. |
I found some time to make a reproduction experiment for what I observed. The issue seems to lie within in
I see that the Given this, it seems that once a rate limit error is encountered, there isn't a way to recover and the call to |
Hey @gdhardy1 , thank you so much for the excellent reproduction materials 🙏 It makes it so much easier to try to help when the problem is clearly demonstrated ❤️ I acknowledge that it is frustrating to see this affecting your business app. Let's see if we can figure out a workaround or how we should go about getting this fixed. A few things that stand out to me immediately:
I wonder what the proper behaviour here is. It feels to me like it would be reasonable to have the socket connection attempt to be re-established in a (presumably?) temporary issue like an HTTP 409. Do you have any other thoughts on what expected behaviour should be in the unfortunate circumstance that socket endpoint is sending 409s? |
Not exactly. I believe there are two issues that lead to multiple retries being executed simultaneously. Issue 1When the socket connection fails with the This leads to
This happens because both the To avoid this, Issue 2numOfConsecutiveReconnectionFailures is reset when This can lead to retries being scheduled earlier than they should. Consider the scenario where:
The retry timing will be:
You can replicate this by running the experiment with I suspect that either of following would be preferable:
However, that might not be necessary as long as Issue 1 is addressed. Answers to other follow up questions.
I wasn't able to capture debug level logs when I encountered this in production. But the reproduction experiment is pretty close if not identical I think.
Actually, after trying, I don't think so. I thought setting |
Thanks for the analysis. I've moved this issue over to the node-slack-sdk (the home of the Separate issue: I think I will riff on your repro a bit to add |
Hello, I recently encountered the same issue. When running two instances of a Bolt app, one of the apps would unexpectedly disconnect from the socket connection. During the reconnection process, it would fail repeatedly, doubling the reconnection attempts each time. This cycle continued until it led to an Out of Memory error, eventually causing the app to crash.
|
Update here: I have escalated this issue to the real-time server backend team to raise the issue related to unexpected 409 responses from that server. I am still working on addressing the issue of what this socket-mode library should do in this situation and how to prevent it from getting into a rate-limit situation with the |
@bitofsky upgrading to @slack/bolt |
@matthewhenry1 Thanks for the response. I was running on @slack/bolt |
The underlying issue still exists in bolt 4.x, the only thing upgrading to 4.x may help is if you experience crashes with a class of errors in the form "unhandled event X in state Y", e.g. "Unhandled event 'server explicit disconnect' in state 'connecting'." (as described in this comment). |
Just to add some more info: I'm also experiencing this issue in a distributed setup, and I'm also sharing app tokens across environments (local, staging, live) which is maybe a bad idea. I'm consistently getting 409 errors. Looking at the logs it seemingly starts with a random Given it's probably a distributed problem - could it be some connection limit issue? The original example of this issue uses 5 instances, which is less than the max 10 connection limit shown here: https://api.slack.com/apis/socket-mode#connections. Could this limit be the cause of the Related question: Does anyone know if this limit is per app token or per app? I'm using latest version |
Issues are still occurring in version 4.1.0. |
This does seem to be an infrastructural issue; from what I understand, the 409 response should be an internal-only signal for the real-time backend system to find another container to serve the connection request. For whatever reason (a recent change or bug), it is bubbling up to the public / to the apps. The infrastructure team is still investigating the cause. |
IBM is hitting this issue attempting to upgrade a major Slack application. Our only workaround was to decrease the number of instances of the application to 1. That is an unacceptable workaround. The reason we are upgrading is because of Unhandled event 'server explicit disconnect causing frequent, unexpected restarts of all the app instances. How can we proceed if web socket 1.x is a problem and web socket 2.x is a problem? We need fixes or workarounds. |
We just experienced another outage today. Luckily, we had alerting set up so that we could mitigate within roughly 10-15 minutes. But honestly, even that amount of downtime is tremendously painful for our users. In case it's helpful with isolating a change that introduced the bug, we started observing the 409 errors on Oct 28th or 29th. Before that time, these errors virtually never occurred. What makes this especially painful is that recovery pretty much requires human intervention and downtime is guaranteed:
|
I also encountered the same issue. Because I have two instances of a Bolt app.
|
Update: the backend team is still investigating the 409 issue. There has been some progress but it still being worked on. I have made some progress on mitigating the double-reconnect logic identified by @gdhardy1 ("Issue 1" in their comment here). Will share a draft PR soon. Still a few other things need to be untangled to make more meaningful progress: how should the socket library react to a 409 generally speaking? Should reconnects be disabled altogether, even if the consumer specifically turned reconnections on? Additionally, proper handling of WS endpoint reconnection rate limiting needs to be explored. |
I have a draft PR that at least depicts the issue in an integration test in this repo: #2099 |
Another update: on Friday afternoon EST, a backend configuration problem was identified and we started partially rolling out a fix. Since then, we have seen metrics indicating that 409 errors are properly being consumed internally which should prevent them from being raised to applications. This fix will be rolled out fully today (Monday) and tomorrow, as so far it is only being partially rolled out to monitor impact on applications. We expect the occurrence of 409s should have reduced since Friday and for them to be gone by Tuesday. |
Thanks. Confirming that while yesterday I saw ~45,000, today I only two early in the AM. |
Update: the backend fix has been rolled out to 75% and we expect to crank it up to 100% if not today then this week. Also: as I mentioned, a PR is up with a proposed fix here: #2099 This fix is available as a pre-release as |
Socket mode receiver fails to make a successful connection due to
WebSocket error occurred: Unexpected server response: 409
error. Retries after this error occurs may eventually lead to a successful connection or worse, rate limiting from Slack servers on theapps.connections.open
call.@slack/bolt
version4.0.1
Your
App
and Receiver ConfigurationSocket mode 2.0.2
Node.js runtime version
v20.17.0
Steps to reproduce:
This error seems to reproduce only when launching in a distributed environment (i.e. launching 2 or more instances of the app).
Nothing special is being done on start up. It is minimally reproducible using the example from BoltJS Getting Started docs as a basis.
process.env.PORT
to a different number in each shellExpected result:
Each instance of the app should start up successfully without the
WebSocket error occurred: Unexpected server response: 409
errorActual result:
First we see the Socket Mode client successfully initialized.
Then we see the
409
error and the following logs repeat over and over again until:Additional Notes
It seems that a successfully established connection does not mean the app is in a "safe" state. While running, I've observed both
Primary WebSocket error occurred: Unexpected server response: 409
andSecondary WebSocket error occurred: Unexpected server response: 409
errors as well. The behavior is basically identical to as described above.I've also observed this on
@slack/bolt-js
versions as early as3.20.0
.The text was updated successfully, but these errors were encountered: