Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittently getting "boskos failed to acquire project" error #1533

Closed
chizhg opened this issue Nov 8, 2019 · 7 comments
Closed

Intermittently getting "boskos failed to acquire project" error #1533

chizhg opened this issue Nov 8, 2019 · 7 comments
Assignees

Comments

@chizhg
Copy link
Member

chizhg commented Nov 8, 2019

For the integration tests Prow jobs, we intermittently got "boskos failed to acquire GKE project" error:

2019/11/07 02:47:45 failed acquiring GKE cluster: 'failed acquiring boskos project: 'boskos failed to acquire GKE project: status 500 Internal Server Error, status code 500'

Expanding boskos pool to 100 (#1529) does not seem to have solved this problem.

Based on the discussion, @chaodaiG has a suspicion that it's because there is a race condition for compute resources between boskos pods and test pods since they are living in the same node pool of the Prow cluster, thus separating them into different node pools should help solve this problem. Before working on this solution, we need to collect relevant metrics to justify it.

@chaodaiG
Copy link
Contributor

This is getting worse, it's happen pretty frequently now, see example: https://testgrid.knative.dev/serving#istio-1.4-no-mesh&width=20&exclude-non-failed-tests=50&sort-by-failures=

We should prioritize investigating and fixing this.

/assign @yt3liu

@chizhg
Copy link
Member Author

chizhg commented Dec 11, 2019

The boskos image update did not fix the issue.

After looking into boskos source code, I'm pretty sure it falls into https://github.com/kubernetes/test-infra/blob/0c5566a3f019399377910bed349c2c280f5539e8/boskos/ranch/ranch.go#L155
when the error happens. One step further, it is
https://github.com/kubernetes/test-infra/blob/0c5566a3f019399377910bed349c2c280f5539e8/boskos/crds/crd_storage.go#L52.

The keep error log is

{"error":"Operation cannot be fulfilled on resources.boskos.k8s.io \"knative-boskos-03\": the object has been modified; please apply your changes to the latest version and try again","level":"error","msg":"No available resource","time":"2019-12-11T06:24:20Z"}

Let's look it deeper tomorrow.

@chaodaiG
Copy link
Contributor

I figured this out too, looking into more info

@chaodaiG
Copy link
Contributor

When I check this morning, the Boskos container in Prow cluster was still using the old image. There are 2 observations:

@yt3liu
Copy link
Contributor

yt3liu commented Dec 11, 2019

In the Stackdriver logs, the first error started to happen in November 11, 2019.

First error in boskos container [Stackdriver log]: 2019-11-11 09:36:24.000 PST
First error from test run [Stackdriver log]: 2019-11-11 11:00:17.228 PST

Around the time, the boskos pool was increased to 100 on November 6 (Pull Request). As boskos writes data to persistent volume before kubernetes/test-infra#13594. The suspicion is the persistent volume may not be able to handle 100 projects.

Next step is to switch boskos to deployment instead of statefulset mentioned earlier after confirming with k8s team.

@chaodaiG
Copy link
Contributor

Updating Boskos to latest version fixed the problem

/close

@knative-prow-robot
Copy link
Collaborator

@chaodaiG: Closing this issue.

In response to this:

Updating Boskos to latest version fixed the problem

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants