Investigate uptick in 502 Bad Gateway responses (~0.001% requests) #446

GUI · 2018-06-18T18:13:07Z

Since yesterday, there's been a small, but noticeable uptick in 502 Bad Gateway responses from API backends. Some level of 502 responses is normal (when the underlying API is actually down), but the uptick seems to be coming from APIs that are in fact up, so this is a problem at our layer.

By my estimates, this is affecting around 0.2%-0.25% of API traffic, so it's not super prevalent, but is still something we need to get fixed to prevent intermittent issues for API consumers.

I'm still investigating the root cause, but I believe this is somehow related to keepalive connections to API backends getting closed prematurely. We made some changes to how our AWS network environment is setup yesterday (using an NAT Gateway so we can better scale out the service, while retaining our static egress IPs), so I believe the issue is related to that change, but still getting to the bottom of it.

GUI · 2018-06-19T00:13:44Z

We've made some changes (at around 3:10PM ET) that largely mitigate this. However, there are still some rare corner cases involving how POST/PUT/PATCH/DELETE requests are handled that we need to address, so there's still a remote possibility this can occur (basically those non-safe requests can't be retried so they're trickier to handle in the face of TCP keepalive connections closing). Based on the past hour of traffic, this has dropped to a handful of instances, representing ~0.001% of our traffic.

While the network changes triggering this were definitely unfortunate, if there's any silver lining, it's that I think this issue was actually happening before, just less frequently. At the low rate it was happening previously, we never detected it, and nobody ever reported issues to us, but I do believe we were experiencing the same issue for a very small fraction of our requests. But now that we've unearthed this, we can better address and test for this.

There's some further changes we can hopefully make to fix the last of the very rare issues (and some of the upcoming updates that we actually made the network changes in preparation for might also help address this). So once we get everything fully solved and tested, I can write up a more detailed postmortem of the issue and fixes.

- Tweak how the upstreams are setup to prevent temporary connection failures from removing the servers from rotation. - Allow connection retries to upstreams in the event of connection failures. - Enable so_keepalive on listening sockets (I don't necessarily think this will help with the upstream keepalive issues, but is probably a good idea, and could help with keepalive behavior to any front-facing load balancers. This cropped up after introducing an AWS NAT Gateway into our stack, which closes inactive keepalive connections after 5 minutes: https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-nat-gateway.html#nat-gateway-troubleshooting-timeout See 18F/api.data.gov#446

gbinal · 2018-10-30T18:21:28Z

Should be addressed by the new infrastructure.

GUI changed the title ~~Investigate uptick in 502 Bad Gateway responses (~0.2% requests)~~ Investigate uptick in 502 Bad Gateway responses (~0.001% requests) Jun 20, 2018

gbinal closed this as completed Oct 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate uptick in 502 Bad Gateway responses (~0.001% requests) #446

Investigate uptick in 502 Bad Gateway responses (~0.001% requests) #446

GUI commented Jun 18, 2018

GUI commented Jun 19, 2018

gbinal commented Oct 30, 2018

Investigate uptick in 502 Bad Gateway responses (~0.001% requests) #446

Investigate uptick in 502 Bad Gateway responses (~0.001% requests) #446

Comments

GUI commented Jun 18, 2018

GUI commented Jun 19, 2018

gbinal commented Oct 30, 2018