-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If pod exists for some reason, it's a terminal failure. #4741
Comments
Hey @vaikas - agree that the error message is confusing and should be fixed. If the pod already exists due to the bug you mentioned, then I agree that we should treat the error as non terminal. But are there any other cases where we run into this error and can we always assume that the already existing pod was created by Tekton vs something else? Would the switch to an informer mean that we hit pod exists condition more often due to the chances of a stale cache? |
Great questions :) Since it seems like we use prebaked names, I was kind of thinking that we might just get rid of the list and use .Get always. At least, that's how I read the comment here, and from my hazy recollection on what the knative childname does :) Line 332 in e46259e
So, if that's indeed the case, then there's really no need for the list (I've seen that done, where the name of the pod is not deterministic and you have look at the labels via listing). Iff however the pod name that taskrun creates (I'm not sure if retrying taskruns change this mode?). I think the other case might be where for some reason there's a pod that's created by something else that us (since I'd reckon they wouldn't have the same labels as we expect :) ) and in that case then it wouldn't get caught by the list, we'd try to create it and it would fail. Informer change should not have any impact on this, except reducing the load on the API server since we do not have to hit it to get pod information. So, I think mayhaps the best path forward is to just add the 'case' statement for IsAlreadyExists, treat it as terminal error as today, but with a more clearer error message? Another thing I guess we could do is after we get the pod (if there was IsAlreadyExists), check for the expected fields (like we do in the List call) to see if it's "our" pod, and only if it doesn't match what we expect, call it a terminal failure. Does that make sense? |
Also, as far as stale cache concerns, we switched to using informers everywhere in Knative long time ago and we have not seen any issues (except better performance, less load on the API server and so forth) over hitting the API servers for get/list operations. |
Thanks for the explanation @vaikas ! Good point about the pods created for retries - I'm actually not sure what pod name we use for retried task runs (@lbernick do you know?)
This SGTM!
This sounds even better :) |
We append a "-retryN" suffix to the previous pod name, e.g. "my-taskrun-pod", "my-taskrun-pod-retry1" |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Expected Behavior
Background: While debugging an issue that caused duplicate reconcile events root cause here, I noticed that if a pod was deemed non-existent at the time
of the list, then when we went to create it later, it may have already been created, and the pod create
failed as expected with AlreadyExists error. However, I was a bit surprised that this lead to TaskRun
failing permanently (and with a slightly confusing error at first (the part about missing or invalid task)):
From here
So, I played around with this a bit, and I made that error not be fatal, and I also upon getting the create error
tried to fetch from informer cache (related to #4740). And things worked just fine (despite there being another
bug, linked to above in Knative pkg).
Anyways, wanted to see if we might want to consider making the pod creating as a transient failure instead
of permanent.
Just wanted to see how folks feel about it.
Actual Behavior
Steps to Reproduce the Problem
Additional Info
Kubernetes version:
Output of
kubectl version
:Tekton Pipeline version:
Output of
tkn version
orkubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
The text was updated successfully, but these errors were encountered: