Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate uptick in 502 Bad Gateway responses (~0.001% requests) #446

Closed
GUI opened this issue Jun 18, 2018 · 2 comments
Closed

Investigate uptick in 502 Bad Gateway responses (~0.001% requests) #446

GUI opened this issue Jun 18, 2018 · 2 comments

Comments

@GUI
Copy link
Contributor

GUI commented Jun 18, 2018

Since yesterday, there's been a small, but noticeable uptick in 502 Bad Gateway responses from API backends. Some level of 502 responses is normal (when the underlying API is actually down), but the uptick seems to be coming from APIs that are in fact up, so this is a problem at our layer.

By my estimates, this is affecting around 0.2%-0.25% of API traffic, so it's not super prevalent, but is still something we need to get fixed to prevent intermittent issues for API consumers.

I'm still investigating the root cause, but I believe this is somehow related to keepalive connections to API backends getting closed prematurely. We made some changes to how our AWS network environment is setup yesterday (using an NAT Gateway so we can better scale out the service, while retaining our static egress IPs), so I believe the issue is related to that change, but still getting to the bottom of it.

@GUI
Copy link
Contributor Author

GUI commented Jun 19, 2018

We've made some changes (at around 3:10PM ET) that largely mitigate this. However, there are still some rare corner cases involving how POST/PUT/PATCH/DELETE requests are handled that we need to address, so there's still a remote possibility this can occur (basically those non-safe requests can't be retried so they're trickier to handle in the face of TCP keepalive connections closing). Based on the past hour of traffic, this has dropped to a handful of instances, representing ~0.001% of our traffic.

While the network changes triggering this were definitely unfortunate, if there's any silver lining, it's that I think this issue was actually happening before, just less frequently. At the low rate it was happening previously, we never detected it, and nobody ever reported issues to us, but I do believe we were experiencing the same issue for a very small fraction of our requests. But now that we've unearthed this, we can better address and test for this.

There's some further changes we can hopefully make to fix the last of the very rare issues (and some of the upcoming updates that we actually made the network changes in preparation for might also help address this). So once we get everything fully solved and tested, I can write up a more detailed postmortem of the issue and fixes.

GUI added a commit to NREL/api-umbrella that referenced this issue Jun 19, 2018
- Tweak how the upstreams are setup to prevent temporary connection
  failures from removing the servers from rotation.
- Allow connection retries to upstreams in the event of connection
  failures.
- Enable so_keepalive on listening sockets (I don't necessarily think
  this will help with the upstream keepalive issues, but is probably a
  good idea, and could help with keepalive behavior to any front-facing
  load balancers.

This cropped up after introducing an AWS NAT Gateway into our stack,
which closes inactive keepalive connections after 5 minutes:
https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-nat-gateway.html#nat-gateway-troubleshooting-timeout

See 18F/api.data.gov#446
@GUI GUI changed the title Investigate uptick in 502 Bad Gateway responses (~0.2% requests) Investigate uptick in 502 Bad Gateway responses (~0.001% requests) Jun 20, 2018
@gbinal
Copy link
Contributor

gbinal commented Oct 30, 2018

Should be addressed by the new infrastructure.

@gbinal gbinal closed this as completed Oct 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants