Intermittently getting "boskos failed to acquire project" error #1533

chizhg · 2019-11-08T02:47:14Z

For the integration tests Prow jobs, we intermittently got "boskos failed to acquire GKE project" error:

2019/11/07 02:47:45 failed acquiring GKE cluster: 'failed acquiring boskos project: 'boskos failed to acquire GKE project: status 500 Internal Server Error, status code 500'

Expanding boskos pool to 100 (#1529) does not seem to have solved this problem.

Based on the discussion, @chaodaiG has a suspicion that it's because there is a race condition for compute resources between boskos pods and test pods since they are living in the same node pool of the Prow cluster, thus separating them into different node pools should help solve this problem. Before working on this solution, we need to collect relevant metrics to justify it.

The text was updated successfully, but these errors were encountered:

chaodaiG · 2019-12-10T15:29:41Z

This is getting worse, it's happen pretty frequently now, see example: https://testgrid.knative.dev/serving#istio-1.4-no-mesh&width=20&exclude-non-failed-tests=50&sort-by-failures=

We should prioritize investigating and fixing this.

/assign @yt3liu

chizhg · 2019-12-11T07:32:02Z

The boskos image update did not fix the issue.

After looking into boskos source code, I'm pretty sure it falls into https://github.com/kubernetes/test-infra/blob/0c5566a3f019399377910bed349c2c280f5539e8/boskos/ranch/ranch.go#L155
when the error happens. One step further, it is
https://github.com/kubernetes/test-infra/blob/0c5566a3f019399377910bed349c2c280f5539e8/boskos/crds/crd_storage.go#L52.

The keep error log is

{"error":"Operation cannot be fulfilled on resources.boskos.k8s.io \"knative-boskos-03\": the object has been modified; please apply your changes to the latest version and try again","level":"error","msg":"No available resource","time":"2019-12-11T06:24:20Z"}

Let's look it deeper tomorrow.

chaodaiG · 2019-12-11T15:13:23Z

I figured this out too, looking into more info

chaodaiG · 2019-12-11T16:53:27Z

When I check this morning, the Boskos container in Prow cluster was still using the old image. There are 2 observations:

Boskos is configured as statefulset, which needs more manual step adopting new image. See config here: https://github.com/knative/test-infra/blob/master/ci/prow/deployments/boskos_statefulset.yaml
Boskos in k8s/test-infra was switched to deployment instead of statefulset in this PR: Update boskos stateful set to use RollingUpdate (default @v1) kubernetes/test-infra#13594

yt3liu · 2019-12-11T17:53:00Z

In the Stackdriver logs, the first error started to happen in November 11, 2019.

First error in boskos container [Stackdriver log]: 2019-11-11 09:36:24.000 PST
First error from test run [Stackdriver log]: 2019-11-11 11:00:17.228 PST

Around the time, the boskos pool was increased to 100 on November 6 (Pull Request). As boskos writes data to persistent volume before kubernetes/test-infra#13594. The suspicion is the persistent volume may not be able to handle 100 projects.

Next step is to switch boskos to deployment instead of statefulset mentioned earlier after confirming with k8s team.

chaodaiG · 2019-12-13T17:10:43Z

Updating Boskos to latest version fixed the problem

/close

knative-prow-robot · 2019-12-13T17:13:45Z

@chaodaiG: Closing this issue.

In response to this:

Updating Boskos to latest version fixed the problem

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

adrcunha added the kind/enhancement label Nov 15, 2019

knative-prow-robot assigned yt3liu Dec 10, 2019

chaodaiG mentioned this issue Dec 10, 2019

Update Boskos images from 20190723 to 20191204 #1582

Merged

chaodaiG mentioned this issue Dec 11, 2019

Switching Boskos to deployment, as well as removing persistent volume #1584

Merged

knative-prow-robot closed this as completed Dec 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittently getting "boskos failed to acquire project" error #1533

Intermittently getting "boskos failed to acquire project" error #1533

chizhg commented Nov 8, 2019 •

edited

Loading

chaodaiG commented Dec 10, 2019

chizhg commented Dec 11, 2019 •

edited

Loading

chaodaiG commented Dec 11, 2019

chaodaiG commented Dec 11, 2019

yt3liu commented Dec 11, 2019

chaodaiG commented Dec 13, 2019

knative-prow-robot commented Dec 13, 2019

Intermittently getting "boskos failed to acquire project" error #1533

Intermittently getting "boskos failed to acquire project" error #1533

Comments

chizhg commented Nov 8, 2019 • edited Loading

chaodaiG commented Dec 10, 2019

chizhg commented Dec 11, 2019 • edited Loading

chaodaiG commented Dec 11, 2019

chaodaiG commented Dec 11, 2019

yt3liu commented Dec 11, 2019

chaodaiG commented Dec 13, 2019

knative-prow-robot commented Dec 13, 2019

chizhg commented Nov 8, 2019 •

edited

Loading

chizhg commented Dec 11, 2019 •

edited

Loading