Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boskos is running out of GKE Projects #15860

Closed
michelle192837 opened this issue Jan 10, 2020 · 6 comments
Closed

Boskos is running out of GKE Projects #15860

michelle192837 opened this issue Jan 10, 2020 · 6 comments
Labels
area/boskos Issues or PRs related to code in /boskos kind/bug Categorizes issue or PR as related to a bug. kind/oncall-hotlist Categorizes issue or PR as tracked by test-infra oncall.

Comments

@michelle192837
Copy link
Contributor

/cc @Katharine who did most of the investigation/mitigation, thanks!

Boskos graphs: https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?orgId=1

Yesterday, we noted from alerts that Boskos resources were starting to run out, with increasingly more resources in the 'dirty' state. Increasing Boskos janitors seemed to increase the rate at which resources were cleaned up, solving the problem for most resource types, but GKE projects seemed to remain at a high number of projects in the 'cleaning' or 'dirty' states, though the numbers seemed to be stable.

Overnight, the number of 'dirty' projects has increased steadily until we're approximately out of projects. Janitors are cleaning up projects (projects are moving to state 'free' on a regular cadence), so it doesn't seem that project cleanup is failing.

Mitigation at the moment is disabling most of the periodic jobs running on GKE to alleviate pressure on the Boskos janitors (see #15854 and #15857). We're starting to see an increase in 'free' projects, will continue monitoring.

@michelle192837 michelle192837 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 10, 2020
@michelle192837
Copy link
Contributor Author

/kind oncall-hotlist

Also of note, this does not appear to be related to changes in Boskos (given dates of last changes to code) or Prow rollouts.

@k8s-ci-robot k8s-ci-robot added the kind/oncall-hotlist Categorizes issue or PR as tracked by test-infra oncall. label Jan 10, 2020
@ixdy
Copy link
Member

ixdy commented Jan 10, 2020

/area boskos

@k8s-ci-robot k8s-ci-robot added the area/boskos Issues or PRs related to code in /boskos label Jan 10, 2020
@michelle192837
Copy link
Contributor Author

Does look like the janitor is in fact encountering issues when deleting resources. Investigation continues.

@michelle192837
Copy link
Contributor Author

The errors the janitor was encountering on cleanup have been resolved, and it looks like the janitors cleaned up GKE projects well over the weekend! Will look into re-enabling the GKE tests (or, alternatively, deleting if they're not actually needed).

Leaving on the oncall hotlist until the jobs have been resolved, but on the Boskos side this is fixed now.

@michelle192837
Copy link
Contributor Author

GKE jobs are being re-enabled (#15882 and #15883).

@michelle192837
Copy link
Contributor Author

Jobs are re-enabled and Boskos appears to be handling them alright. Will continue monitoring, but for now I think this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/boskos Issues or PRs related to code in /boskos kind/bug Categorizes issue or PR as related to a bug. kind/oncall-hotlist Categorizes issue or PR as tracked by test-infra oncall.
Projects
None yet
Development

No branches or pull requests

3 participants