TaskRun fails with recoverable mount error #6960

RafaeLeal · 2023-07-21T19:44:41Z

Expected Behavior

TaskRun's pods should be able to recover from transient mount errors

Actual Behavior

When such an error occurs, the pod gets the state CreateContainerConfigError then the TaskRun fails.
Often the pod recovers, but it's too late.
This behavior was introduced in #1907

Steps to Reproduce the Problem

Not sure exactly how to reproduce this, but we have a fairly big Tekton cluster and it happens quite often with a volume that uses AWS EFS.
What happens it's that we notice the pod status like this:

status:
  conditions:
    - ...
    - type: "ContainersReady"
      status: "False"
      lastProbeTime: null
      lastTransitionTime: "2023-05-12T14:00:14Z"
      reason: "ContainersNotReady"
      message: "containers with unready status: [step-checkout]"
  containerStatuses:
    - name: "step-checkout"
      state:
        waiting:
          reason: "CreateContainerConfigError"
          message: "failed to create subPath directory for volumeMount \"ws-dmnjx\" of container \"step-checkout\""

Additional Info

Kubernetes version:

Output of kubectl version:

Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.3-eks-a5565ad", GitCommit:"78c8293d1c65e8a153bf3c03802ab9358c0e1a14", GitTreeState:"clean", BuildDate:"2023-06-16T17:32:40Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.48.0

The text was updated successfully, but these errors were encountered:

RafaeLeal · 2023-07-21T19:46:07Z

I can help with the fix...
I was considering having a grace period before setting the TaskRun status as error.
I'm not sure if we should hard-code this grace period or make it configurable via tekton controller's config maps. WDYT?
Do we need a WEP for this?

tekton-robot · 2023-10-19T20:20:56Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

vdemeester · 2023-10-23T10:41:38Z

/remove-lifecycle stale
@RafaeLeal I think that can make sense (having a grace period for this). I feel we might not necessarily need a TEP for this.
cc @afrittoli

codegold79 · 2024-03-25T20:49:19Z

My team has tried to recover from a CreateContainerConfigError because the TaskRun hasn't really failed. Notice no completionTimestamp, and the one status.steps item is waiting, not terminated.

TaskRun

status:
  conditions:
  - lastTransitionTime: "2024-03-22T18:09:56Z"
    message: Failed to create pod due to config error
    reason: CreateContainerConfigError
    status: "False"
    type: Succeeded
  startTime: "2024-03-22T18:09:40Z"
  steps:
  - container: step-check-step
    name: check-step
    waiting:
      message: secret "oci-store" not found
      reason: CreateContainerConfigError

In that waiting (but failed status) state, we tried to provide the correct configuration to pull an image, but the task never recovered. We had a pipeline tied to the task (it spawned the task), and it was in a terminated/failed/non-waiting/non-recoverable state.

We also went the other way, and waited for the pod to timeout while waiting, but the Task doesn't switch to being timed out. And of course, PipelineRun is still failed. And the pod hangs, never deleting.

I wonder, @RafaeLeal, you mentioned that the TaskRun fails, and pod recovers, but too late. In that state, is the TaskRun terminated at that point, with a completionTime, or is it still in waiting? I wonder if your problem is the same as ours, or if we need to make a separate issue.

codegold79 · 2024-03-25T20:51:52Z

There are a few other similar issues, some closed due to inactivity, but this issue (#6960) seems closest to what my team is seeing.

RafaeLeal added the kind/bug Categorizes issue or PR as related to a bug. label Jul 21, 2023

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2023

tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaskRun fails with recoverable mount error #6960

TaskRun fails with recoverable mount error #6960

RafaeLeal commented Jul 21, 2023

RafaeLeal commented Jul 21, 2023

tekton-robot commented Oct 19, 2023

vdemeester commented Oct 23, 2023

codegold79 commented Mar 25, 2024 •

edited

Loading

codegold79 commented Mar 25, 2024

TaskRun fails with recoverable mount error #6960

TaskRun fails with recoverable mount error #6960

Comments

RafaeLeal commented Jul 21, 2023

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

RafaeLeal commented Jul 21, 2023

tekton-robot commented Oct 19, 2023

vdemeester commented Oct 23, 2023

codegold79 commented Mar 25, 2024 • edited Loading

codegold79 commented Mar 25, 2024

codegold79 commented Mar 25, 2024 •

edited

Loading