-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate uptick in 502 Bad Gateway responses (~0.001% requests) #446
Comments
We've made some changes (at around 3:10PM ET) that largely mitigate this. However, there are still some rare corner cases involving how POST/PUT/PATCH/DELETE requests are handled that we need to address, so there's still a remote possibility this can occur (basically those non-safe requests can't be retried so they're trickier to handle in the face of TCP keepalive connections closing). Based on the past hour of traffic, this has dropped to a handful of instances, representing ~0.001% of our traffic. While the network changes triggering this were definitely unfortunate, if there's any silver lining, it's that I think this issue was actually happening before, just less frequently. At the low rate it was happening previously, we never detected it, and nobody ever reported issues to us, but I do believe we were experiencing the same issue for a very small fraction of our requests. But now that we've unearthed this, we can better address and test for this. There's some further changes we can hopefully make to fix the last of the very rare issues (and some of the upcoming updates that we actually made the network changes in preparation for might also help address this). So once we get everything fully solved and tested, I can write up a more detailed postmortem of the issue and fixes. |
- Tweak how the upstreams are setup to prevent temporary connection failures from removing the servers from rotation. - Allow connection retries to upstreams in the event of connection failures. - Enable so_keepalive on listening sockets (I don't necessarily think this will help with the upstream keepalive issues, but is probably a good idea, and could help with keepalive behavior to any front-facing load balancers. This cropped up after introducing an AWS NAT Gateway into our stack, which closes inactive keepalive connections after 5 minutes: https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-nat-gateway.html#nat-gateway-troubleshooting-timeout See 18F/api.data.gov#446
Should be addressed by the new infrastructure. |
Since yesterday, there's been a small, but noticeable uptick in 502 Bad Gateway responses from API backends. Some level of 502 responses is normal (when the underlying API is actually down), but the uptick seems to be coming from APIs that are in fact up, so this is a problem at our layer.
By my estimates, this is affecting around 0.2%-0.25% of API traffic, so it's not super prevalent, but is still something we need to get fixed to prevent intermittent issues for API consumers.
I'm still investigating the root cause, but I believe this is somehow related to keepalive connections to API backends getting closed prematurely. We made some changes to how our AWS network environment is setup yesterday (using an NAT Gateway so we can better scale out the service, while retaining our static egress IPs), so I believe the issue is related to that change, but still getting to the bottom of it.
The text was updated successfully, but these errors were encountered: