-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittently getting "boskos failed to acquire project" error #1533
Comments
This is getting worse, it's happen pretty frequently now, see example: https://testgrid.knative.dev/serving#istio-1.4-no-mesh&width=20&exclude-non-failed-tests=50&sort-by-failures= We should prioritize investigating and fixing this. /assign @yt3liu |
The boskos image update did not fix the issue. After looking into boskos source code, I'm pretty sure it falls into https://github.com/kubernetes/test-infra/blob/0c5566a3f019399377910bed349c2c280f5539e8/boskos/ranch/ranch.go#L155 The keep error log is
Let's look it deeper tomorrow. |
I figured this out too, looking into more info |
When I check this morning, the Boskos container in Prow cluster was still using the old image. There are 2 observations:
|
In the Stackdriver logs, the first error started to happen in November 11, 2019. First error in boskos container [Stackdriver log]: 2019-11-11 09:36:24.000 PST Around the time, the boskos pool was increased to 100 on November 6 (Pull Request). As boskos writes data to persistent volume before kubernetes/test-infra#13594. The suspicion is the persistent volume may not be able to handle 100 projects. Next step is to switch boskos to deployment instead of statefulset mentioned earlier after confirming with k8s team. |
Updating Boskos to latest version fixed the problem /close |
@chaodaiG: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
For the integration tests Prow jobs, we intermittently got "boskos failed to acquire GKE project" error:
Expanding boskos pool to 100 (#1529) does not seem to have solved this problem.
Based on the discussion, @chaodaiG has a suspicion that it's because there is a race condition for compute resources between boskos pods and test pods since they are living in the same node pool of the Prow cluster, thus separating them into different node pools should help solve this problem. Before working on this solution, we need to collect relevant metrics to justify it.
The text was updated successfully, but these errors were encountered: