Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boskos seems to be wedged #186

Closed
bobcatfish opened this issue Jan 18, 2020 · 11 comments
Closed

Boskos seems to be wedged #186

bobcatfish opened this issue Jan 18, 2020 · 11 comments
Labels
area/boskos Issues or PRs related to code in /boskos area/test-infra Issues or PRs related to the testing infrastructure kind/bug Categorizes issue or PR as related to a bug.

Comments

@bobcatfish
Copy link
Contributor

bobcatfish commented Jan 18, 2020

Expected Behavior

Boskos should clean up projects once they are done being use and make them available for future use.

Actual Behavior

tektoncd/pipeline#1541 and tektoncd/pipeline#1888 both have consistently failing integration tests with an error like:

I0117 21:16:30.477] 2020/01/17 21:16:30 main.go:734: provider gke, will acquire project type gke-project from boskos
I0117 21:21:30.475] 2020/01/17 21:21:30 main.go:316: Something went wrong: failed to prepare test environment: --provider=gke boskos failed to acquire project: resources not found

In #29 and other times in the past we have responded to this error by provisioning more projects for boskos.

This time though it's definitely not the case that all the projects are in use:

When I look at the logs from the boskos Janitor I see this kind of error:

 msg: "failed to clean up project tekton-prow-10, error info: Activated service account credentials for: [[email protected]]
ERROR: (gcloud.compute.instances.list) Some requests did not succeed:
 - Invalid value for field 'zone': 'asia-northeast3-a'. Unknown zone.
 - Invalid value for field 'zone': 'asia-northeast3-b'. Unknown zone.
 - Invalid value for field 'zone': 'asia-northeast3-c'. Unknown zone.

Fail to list resource 'instances' from project 'tekton-prow-10'
ERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global 

To search the help text of gcloud commands, run:
  gcloud help -- SEARCH_TERMS
Error try to delete resources: CalledProcessError()
ERROR: (gcloud.container.clusters.list) ResponseError: code=404, message=Not Found.
[=== Start Janitor on project 'tekton-prow-10' ===]
[=== Activating service_account /etc/test-account/service-account.json ===]
[=== Finish Janitor on project 'tekton-prow-10' with status 1 ===]

I think the gcloud error might be a red herring, maybe a state that boskos gets into after some other kind of error first.

CPU and memory usage for both boskos + the boskos janitor started going up a few hours ago but its hard to say if that is causing the problem or if the problem is causing it:

image

Also this particular janitor pod has been steadily using more and more memory (interestingly this one was started on Jan 6 but the other two janitor pods had been around since like may)

image

The other 2 janitor pods look like:

image

Additional Info

I couldn't find any other quotas that seemed like they needed increasing. I think there's a good job that boskos got into a bad state and just restarting everything will fix it.

coincidentally there was a (seemingly unrelated?) GCP outage at the time when these errors started: https://status.cloud.google.com/incident/zall/20001 So maybe that put things into a bad state

it's also possible that this is because we're using such an old version of boskos and it might need an update - i think there's a good chance that updating boskos will solve the whole thing but I didn't want to rush to do that since we might run into other problems.

bobcatfish added a commit to bobcatfish/plumbing that referenced this issue Jan 18, 2020
Trying to address tektoncd#186 but it
does nothing
@bobcatfish
Copy link
Contributor Author

I tried some things but nothing has worked:

@bobcatfish
Copy link
Contributor Author

I think the next thing to try is to update the boskos images to something newer: maybe the reason that outage + this started at the same time was that there was a rollout of something that is no longer compatible with our ancient boskos images (and their gcloud install?)

@afrittoli
Copy link
Member

Thank you for the detailed analysis!
I might be able to try and update boskos in the Prow cluster later today, unless someone else can do that befor.

@afrittoli
Copy link
Member

The issue seems to be still there, since I can see still a lot of failures in https://tekton-releases.appspot.com/builds/tekton-prow/pr-logs/directory/pull-tekton-pipeline-integration-tests.
I tried a \retest on one PR, and it was able to get a cluster from Boskos though.
Checking boskos logs, this shows up continuously:

{\"type\":\"gke-project\",\"name\":\"tekton-prow-13\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:25.864824733Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-5\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:26.78376571Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-12\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:28.135215444Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-7\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:28.455754565Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-11\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:29.393499993Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-1\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:31.945974382Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-3\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:35.84089279Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-4\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:35.926420242Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-2\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:38.518577986Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-9\",\"state\":\"free\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T09:37:38.586946232Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-6\",\"state\":\"busy\",\"owner\":\"pull-tekton-pipeline-integration-tests\",\"lastupdate\":\"2020-01-18T10:42:54.252574171Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-14\",\"state\":\"dirty\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T10:43:23.243993663Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-10\",\"state\":\"dirty\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T10:43:27.643677059Z\",\"userdata\":{}},
{\"type\":\"gke-project\",\"name\":\"tekton-prow-0\",\"state\":\"dirty\",\"owner\":\"\",\"lastupdate\":\"2020-01-18T10:43:30.207256322Z\",\"userdata\":{}}]"

The busy project is the one used for my retest, but there are three projects that are dirty with no owner, that boskos keeps trying to reset, but with no luck.
I tried deleting the boskos-reaper pod, which was 254d old, but it doesn't seem to help.

For project14 specifically, something seems to be wrong with the setup:

ERROR: (gcloud.compute.sole-tenancy.node-templates.list) HTTPError 403: Access Not Configured. Compute Engine API has not been used in project [censored] before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=[censored] then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.\nFail to list resource 'sole-tenancy' from project 'tekton-prow-14'

@vdemeester
Copy link
Member

/area boskos
/kind bug
/area test-infra

@tekton-robot tekton-robot added area/boskos Issues or PRs related to code in /boskos kind/bug Categorizes issue or PR as related to a bug. area/test-infra Issues or PRs related to the testing infrastructure labels Jan 20, 2020
@afrittoli
Copy link
Member

I tried updating boskos to the latest image available v20190621-ff01381, but the error in the janitor logs persists:

{"error":"exit status 1","level":"error","msg":"failed to clean up project tekton-prow-10, error info: Activated service account credentials for: [[email protected]]\nERROR: (gcloud.compute.disks.delete) unrecognized arguments: --global \n\nTo search the help text of gcloud commands, run:\n  gcloud help -- SEARCH_TERMS\nError try to delete resources disks: CalledProcessError()\nERROR: (gcloud.container.clusters.list) ResponseError: code=404, message=Not Found.\n[=== Start Janitor on project 'tekton-prow-10' ===]\n[=== Activating service_account /etc/test-account/service-account.json ===]\n[=== Finish Janitor on project 'tekton-prow-10' with status 1 ===]\n","time":"2020-01-20T11:02:46Z"}

@afrittoli
Copy link
Member

Looking at tekton-prow-10 via the console, there is no k8s cluster in the project, but there are two PVC backing disks left around:

image

I don't have permissions to delete them manually - doing so might unblock the project until we sort out the issue on boskos side.

afrittoli added a commit to afrittoli/plumbing that referenced this issue Jan 20, 2020
Update the container image used by Boskos components in an attempt
to solve tektoncd#186.
tekton-robot pushed a commit that referenced this issue Jan 20, 2020
Update the container image used by Boskos components in an attempt
to solve #186.
@bobcatfish
Copy link
Contributor Author

I've deleted the disks from projects 10 and 0!

@bobcatfish
Copy link
Contributor Author

Looks like lots of folks using boskos ran into problems on Friday: kubernetes/test-infra#15951

@bobcatfish
Copy link
Contributor Author

Okay so I looked into it a bit more. When I looked today after deleting the disks that @afrittoli noticed were not deleted (TODO for myself: give everyone more permissions!! or we can move to a model where maybe the CDF owns these clusters and everyone can be a full admin!!! ANYWAY) and all of the clusters were "free".

I (finally) noticed that the Prow folks ran into these same errors on Friday resulting in kubernetes/test-infra#15951. It looks like the consensus is that gcloud itself had an outage (i.e. the API it communicates with).

I also updated to the latest boskos images that the Prow folks are using (and noticed #193) but we should be good to go now!

bobcatfish added a commit to bobcatfish/plumbing that referenced this issue Jan 21, 2020
Using the same boskos version the prow folks are using, e.g.,
https://github.com/kubernetes/test-infra/blob/b2471685eed6a7d063d7e1e19032282bb33679db/prow/cluster/boskos.yaml#L65
which they bumped to in the context of dealing with the same issue we
ran into on Friday (tektoncd#186)
tekton-robot pushed a commit that referenced this issue Jan 22, 2020
Using the same boskos version the prow folks are using, e.g.,
https://github.com/kubernetes/test-infra/blob/b2471685eed6a7d063d7e1e19032282bb33679db/prow/cluster/boskos.yaml#L65
which they bumped to in the context of dealing with the same issue we
ran into on Friday (#186)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/boskos Issues or PRs related to code in /boskos area/test-infra Issues or PRs related to the testing infrastructure kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants