Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If pod exists for some reason, it's a terminal failure. #4741

Closed
vaikas opened this issue Apr 7, 2022 · 9 comments
Closed

If pod exists for some reason, it's a terminal failure. #4741

vaikas opened this issue Apr 7, 2022 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@vaikas
Copy link
Contributor

vaikas commented Apr 7, 2022

Expected Behavior

Background: While debugging an issue that caused duplicate reconcile events root cause here, I noticed that if a pod was deemed non-existent at the time
of the list, then when we went to create it later, it may have already been created, and the pod create
failed as expected with AlreadyExists error. However, I was a bit surprised that this lead to TaskRun
failing permanently (and with a slightly confusing error at first (the part about missing or invalid task)):

  status:
    completionTime: "2022-04-04T16:22:28Z"
    conditions:
    - lastTransitionTime: "2022-04-04T16:22:28Z"
      message: 'failed to create task run pod "sbom-syft-2xvwm": pods "sbom-syft-2xvwm-pod"
        already exists. Maybe missing or invalid Task syft-sboms/sbom-syft'
      reason: CouldntGetTask
      status: "False"
      type: Succeeded

From here

So, I played around with this a bit, and I made that error not be fatal, and I also upon getting the create error
tried to fetch from informer cache (related to #4740). And things worked just fine (despite there being another
bug, linked to above in Knative pkg).

Anyways, wanted to see if we might want to consider making the pod creating as a transient failure instead
of permanent.

Just wanted to see how folks feel about it.

Actual Behavior

Steps to Reproduce the Problem

Additional Info

  • Kubernetes version:

    Output of kubectl version:

    (paste your output here)
    
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

@vaikas vaikas added the kind/bug Categorizes issue or PR as related to a bug. label Apr 7, 2022
@vaikas vaikas mentioned this issue Apr 7, 2022
5 tasks
@dibyom
Copy link
Member

dibyom commented Apr 8, 2022

Hey @vaikas - agree that the error message is confusing and should be fixed. If the pod already exists due to the bug you mentioned, then I agree that we should treat the error as non terminal. But are there any other cases where we run into this error and can we always assume that the already existing pod was created by Tekton vs something else? Would the switch to an informer mean that we hit pod exists condition more often due to the chances of a stale cache?

@vaikas
Copy link
Contributor Author

vaikas commented Apr 8, 2022

Great questions :) Since it seems like we use prebaked names, I was kind of thinking that we might just get rid of the list and use .Get always. At least, that's how I read the comment here, and from my hazy recollection on what the knative childname does :)

// Generate a unique name based on the build's name.

So, if that's indeed the case, then there's really no need for the list (I've seen that done, where the name of the pod is not deterministic and you have look at the labels via listing).

Iff however the pod name that taskrun creates (I'm not sure if retrying taskruns change this mode?).

I think the other case might be where for some reason there's a pod that's created by something else that us (since I'd reckon they wouldn't have the same labels as we expect :) ) and in that case then it wouldn't get caught by the list, we'd try to create it and it would fail.

Informer change should not have any impact on this, except reducing the load on the API server since we do not have to hit it to get pod information.

So, I think mayhaps the best path forward is to just add the 'case' statement for IsAlreadyExists, treat it as terminal error as today, but with a more clearer error message?

Another thing I guess we could do is after we get the pod (if there was IsAlreadyExists), check for the expected fields (like we do in the List call) to see if it's "our" pod, and only if it doesn't match what we expect, call it a terminal failure.

Does that make sense?

@vaikas
Copy link
Contributor Author

vaikas commented Apr 8, 2022

Also, as far as stale cache concerns, we switched to using informers everywhere in Knative long time ago and we have not seen any issues (except better performance, less load on the API server and so forth) over hitting the API servers for get/list operations.

@dibyom
Copy link
Member

dibyom commented Apr 12, 2022

Thanks for the explanation @vaikas ! Good point about the pods created for retries - I'm actually not sure what pod name we use for retried task runs (@lbernick do you know?)

So, I think mayhaps the best path forward is to just add the 'case' statement for IsAlreadyExists, treat it as terminal error as today, but with a more clearer error message?

This SGTM!

Another thing I guess we could do is after we get the pod (if there was IsAlreadyExists), check for the expected fields (like we do in the List call) to see if it's "our" pod, and only if it doesn't match what we expect, call it a terminal failure.

This sounds even better :)

@lbernick
Copy link
Member

I'm actually not sure what pod name we use for retried task runs (@lbernick do you know?)

We append a "-retryN" suffix to the previous pod name, e.g. "my-taskrun-pod", "my-taskrun-pod-retry1"

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 11, 2022
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants