Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

boskos/janitor: track when cleanup fails repeatedly for the same resource #15866

Closed
ixdy opened this issue Jan 10, 2020 · 8 comments
Closed

boskos/janitor: track when cleanup fails repeatedly for the same resource #15866

ixdy opened this issue Jan 10, 2020 · 8 comments
Labels
area/boskos Issues or PRs related to code in /boskos kind/feature Categorizes issue or PR as related to a new feature.

Comments

@ixdy
Copy link
Member

ixdy commented Jan 10, 2020

Due to programming errors, the janitor may continuously fail to clean up a resource. Two examples I just discovered:

possibly an order-of-deletion issue:

{"error":"exit status 1","level":"info","msg":"failed to clean up project kube-gke-upg-1-2-1-3-upg-clu-n, error info: Activated service account credentials for: [[email protected]]\nERROR: (gcloud.compute.networks.delete) Could not fetch resource:\n - The network resource 'projects/kube-gke-upg-1-2-1-3-upg-clu-n/global/networks/jenkins-e2e' is already being used by 'projects/kube-gke-upg-1-2-1-3-upg-clu-n/global/routes/default-route-92807148d5aa60d1'\n\nError try to delete resources networks: CalledProcessError()\n[=== Start Janitor on project 'kube-gke-upg-1-2-1-3-upg-clu-n' ===]\n[=== Activating service_account /etc/service-account/service-account.json ===]\n[=== Finish Janitor on project 'kube-gke-upg-1-2-1-3-upg-clu-n' with status 1 ===]\n","time":"2020-01-10T21:03:14Z"}

likely incorrect flags (gcloud changed but we didn't?):

{"error":"exit status 1","level":"info","msg":"failed to clean up project k8s-jkns-e2e-gke-ci-canary, error info: Activated service account credentials for: [[email protected]]\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --region=https://www.googleapis.com/compute/v1/projects/k8s-jkns-e2e-gke-ci-canary/regions/us-central1 \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\n[=== Start Janitor on project 'k8s-jkns-e2e-gke-ci-canary' ===]\n[=== Activating service_account /etc/service-account/service-account.json ===]\n[=== Finish Janitor on project 'k8s-jkns-e2e-gke-ci-canary' with status 1 ===]\n","time":"2020-01-10T21:18:55Z"}

It'd be good to have some way of detecting when we're repeatedly failing to clean up a resource.
Not sure yet what the best way would be to track that.

@ixdy ixdy added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 10, 2020
@ixdy
Copy link
Member Author

ixdy commented Jan 10, 2020

/area boskos

@k8s-ci-robot k8s-ci-robot added the area/boskos Issues or PRs related to code in /boskos label Jan 10, 2020
@dims
Copy link
Member

dims commented Jan 11, 2020

@ixdy would also help if we can publish the logs from boskos somewhere public.

@ixdy
Copy link
Member Author

ixdy commented Jan 11, 2020

@dims I'm not sure where we'd publish them, and I'm also not sure we've done a great job of sanitizing the logs yet (to ensure that they're not leaking any sensitive information). The logs are visible to anyone maintaining the prow cluster, though are in some cases lacking useful information.

Regarding tracking cleanup failures, I have a few potential ideas:

  1. If cleanup fails, set metadata on the resource indicating how many cleanup attempts have occurred. Maybe we could use userData rather than adding a new field? If cleanup succeeds, we could clear this information.
  2. Add metrics to the janitor (and start collecting them) tracking the number of successful or failed cleanup attempts, possibly segmented by resource type. This wouldn't tell us if a single resource was repeatedly failing, but a change in rate would be useful for detecting and triaging an issue like Boskos is running out of GKE Projects #15860 before all resources have been exhausted.

@ixdy
Copy link
Member Author

ixdy commented Jan 11, 2020

Option 2 is closely aligned with #14715.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2020
@ixdy
Copy link
Member Author

ixdy commented Apr 10, 2020

/remove-lifecycle stale

still want to do this.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2020
@ixdy
Copy link
Member Author

ixdy commented May 29, 2020

Moving to kubernetes-sigs/boskos#15.
/close

@k8s-ci-robot
Copy link
Contributor

@ixdy: Closing this issue.

In response to this:

Moving to kubernetes-sigs/boskos#15.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/boskos Issues or PRs related to code in /boskos kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants