Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A variety of network calls from GoCD seem less reliable #404

Closed
robrap opened this issue Aug 17, 2023 · 3 comments
Closed

A variety of network calls from GoCD seem less reliable #404

robrap opened this issue Aug 17, 2023 · 3 comments
Labels
escalate-to-psre Create a PSRE ticket for this issue esre

Comments

@robrap
Copy link
Contributor

robrap commented Aug 17, 2023

Context:

The symptom we are seeing is that we have many more OpsGenie alerts coming from GoCD for the edxapp pipeline nw than in the past. Additionally, we have OpsGenie configured to only alert the team if the failure happens twice, and most alerts self-close. This typically means the issue was flaky, which is often a networking issue.

We've been adding a bunch of retries recently around failures related to failed downloads, as an example.

We have other transient failures ticketed:

And still others that have yet to be ticketed.

See GoCD alerts graph:
Image

Question:

  • Is it possible that the NAT has reached capacity for outbound requests, or some other infrastructure limitation that has been reached?
@robrap robrap added this to Arch-BOM Aug 17, 2023
@robrap robrap converted this from a draft issue Aug 17, 2023
@robrap robrap added escalate-to-psre Create a PSRE ticket for this issue esre labels Aug 17, 2023
@github-actions
Copy link

@robrap
Copy link
Contributor Author

robrap commented Aug 17, 2023

[inform] These were the tags I used for OpsGenie Analytics:
Image

@robrap robrap changed the title A variety of network calls seem less reliable A variety of network calls from GoCD seem less reliable Aug 18, 2023
@robrap
Copy link
Contributor Author

robrap commented Aug 25, 2023

We've had enough success with retries at the moment that we've decided not to pursue determining if there is an underlying issue. I'm closing this ticket.

@robrap robrap closed this as completed Aug 25, 2023
@github-project-automation github-project-automation bot moved this to Done in Arch-BOM Aug 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
escalate-to-psre Create a PSRE ticket for this issue esre
Projects
Archived in project
Development

No branches or pull requests

1 participant