-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boskos is running out of GKE Projects #15860
Comments
/kind oncall-hotlist Also of note, this does not appear to be related to changes in Boskos (given dates of last changes to code) or Prow rollouts. |
/area boskos |
Does look like the janitor is in fact encountering issues when deleting resources. Investigation continues. |
The errors the janitor was encountering on cleanup have been resolved, and it looks like the janitors cleaned up GKE projects well over the weekend! Will look into re-enabling the GKE tests (or, alternatively, deleting if they're not actually needed). Leaving on the oncall hotlist until the jobs have been resolved, but on the Boskos side this is fixed now. |
Jobs are re-enabled and Boskos appears to be handling them alright. Will continue monitoring, but for now I think this can be closed. |
/cc @Katharine who did most of the investigation/mitigation, thanks!
Boskos graphs: https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?orgId=1
Yesterday, we noted from alerts that Boskos resources were starting to run out, with increasingly more resources in the 'dirty' state. Increasing Boskos janitors seemed to increase the rate at which resources were cleaned up, solving the problem for most resource types, but GKE projects seemed to remain at a high number of projects in the 'cleaning' or 'dirty' states, though the numbers seemed to be stable.
Overnight, the number of 'dirty' projects has increased steadily until we're approximately out of projects. Janitors are cleaning up projects (projects are moving to state 'free' on a regular cadence), so it doesn't seem that project cleanup is failing.
Mitigation at the moment is disabling most of the periodic jobs running on GKE to alleviate pressure on the Boskos janitors (see #15854 and #15857). We're starting to see an increase in 'free' projects, will continue monitoring.
The text was updated successfully, but these errors were encountered: