If pod exists for some reason, it's a terminal failure. #4741

vaikas · 2022-04-07T21:54:18Z

Expected Behavior

Background: While debugging an issue that caused duplicate reconcile events root cause here, I noticed that if a pod was deemed non-existent at the time
of the list, then when we went to create it later, it may have already been created, and the pod create
failed as expected with AlreadyExists error. However, I was a bit surprised that this lead to TaskRun
failing permanently (and with a slightly confusing error at first (the part about missing or invalid task)):

  status:
    completionTime: "2022-04-04T16:22:28Z"
    conditions:
    - lastTransitionTime: "2022-04-04T16:22:28Z"
      message: 'failed to create task run pod "sbom-syft-2xvwm": pods "sbom-syft-2xvwm-pod"
        already exists. Maybe missing or invalid Task syft-sboms/sbom-syft'
      reason: CouldntGetTask
      status: "False"
      type: Succeeded

From here

So, I played around with this a bit, and I made that error not be fatal, and I also upon getting the create error
tried to fetch from informer cache (related to #4740). And things worked just fine (despite there being another
bug, linked to above in Knative pkg).

Anyways, wanted to see if we might want to consider making the pod creating as a transient failure instead
of permanent.

Just wanted to see how folks feel about it.

Actual Behavior

Steps to Reproduce the Problem

Additional Info

Kubernetes version:

Output of kubectl version:
```
(paste your output here)
```
Tekton Pipeline version:

Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

The text was updated successfully, but these errors were encountered:

dibyom · 2022-04-08T19:07:43Z

Hey @vaikas - agree that the error message is confusing and should be fixed. If the pod already exists due to the bug you mentioned, then I agree that we should treat the error as non terminal. But are there any other cases where we run into this error and can we always assume that the already existing pod was created by Tekton vs something else? Would the switch to an informer mean that we hit pod exists condition more often due to the chances of a stale cache?

vaikas · 2022-04-08T19:24:18Z

Great questions :) Since it seems like we use prebaked names, I was kind of thinking that we might just get rid of the list and use .Get always. At least, that's how I read the comment here, and from my hazy recollection on what the knative childname does :)

pipeline/pkg/pod/pod.go

Line 332 in e46259e

// Generate a unique name based on the build's name.

So, if that's indeed the case, then there's really no need for the list (I've seen that done, where the name of the pod is not deterministic and you have look at the labels via listing).

Iff however the pod name that taskrun creates (I'm not sure if retrying taskruns change this mode?).

I think the other case might be where for some reason there's a pod that's created by something else that us (since I'd reckon they wouldn't have the same labels as we expect :) ) and in that case then it wouldn't get caught by the list, we'd try to create it and it would fail.

Informer change should not have any impact on this, except reducing the load on the API server since we do not have to hit it to get pod information.

So, I think mayhaps the best path forward is to just add the 'case' statement for IsAlreadyExists, treat it as terminal error as today, but with a more clearer error message?

Another thing I guess we could do is after we get the pod (if there was IsAlreadyExists), check for the expected fields (like we do in the List call) to see if it's "our" pod, and only if it doesn't match what we expect, call it a terminal failure.

Does that make sense?

vaikas · 2022-04-08T19:26:15Z

Also, as far as stale cache concerns, we switched to using informers everywhere in Knative long time ago and we have not seen any issues (except better performance, less load on the API server and so forth) over hitting the API servers for get/list operations.

dibyom · 2022-04-12T23:00:02Z

Thanks for the explanation @vaikas ! Good point about the pods created for retries - I'm actually not sure what pod name we use for retried task runs (@lbernick do you know?)

So, I think mayhaps the best path forward is to just add the 'case' statement for IsAlreadyExists, treat it as terminal error as today, but with a more clearer error message?

This SGTM!

Another thing I guess we could do is after we get the pod (if there was IsAlreadyExists), check for the expected fields (like we do in the List call) to see if it's "our" pod, and only if it doesn't match what we expect, call it a terminal failure.

This sounds even better :)

lbernick · 2022-04-13T12:51:18Z

I'm actually not sure what pod name we use for retried task runs (@lbernick do you know?)

We append a "-retryN" suffix to the previous pod name, e.g. "my-taskrun-pod", "my-taskrun-pod-retry1"

tekton-robot · 2022-07-12T12:53:32Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-08-11T13:14:17Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-09-10T13:37:31Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2022-09-10T13:37:33Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vaikas added the kind/bug Categorizes issue or PR as related to a bug. label Apr 7, 2022

vaikas mentioned this issue Apr 7, 2022

Non terminal pod exists #4742

Merged

5 tasks

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 11, 2022

tekton-robot closed this as completed Sep 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If pod exists for some reason, it's a terminal failure. #4741

If pod exists for some reason, it's a terminal failure. #4741

vaikas commented Apr 7, 2022

dibyom commented Apr 8, 2022

vaikas commented Apr 8, 2022

vaikas commented Apr 8, 2022

dibyom commented Apr 12, 2022

lbernick commented Apr 13, 2022

tekton-robot commented Jul 12, 2022

tekton-robot commented Aug 11, 2022

tekton-robot commented Sep 10, 2022

tekton-robot commented Sep 10, 2022

If pod exists for some reason, it's a terminal failure. #4741

If pod exists for some reason, it's a terminal failure. #4741

Comments

vaikas commented Apr 7, 2022

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

dibyom commented Apr 8, 2022

vaikas commented Apr 8, 2022

vaikas commented Apr 8, 2022

dibyom commented Apr 12, 2022

lbernick commented Apr 13, 2022

tekton-robot commented Jul 12, 2022

tekton-robot commented Aug 11, 2022

tekton-robot commented Sep 10, 2022

tekton-robot commented Sep 10, 2022