Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken "message" on last transition in case of failure #2029

Closed
afrittoli opened this issue Feb 10, 2020 · 14 comments
Closed

Broken "message" on last transition in case of failure #2029

afrittoli opened this issue Feb 10, 2020 · 14 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@afrittoli
Copy link
Member

Expected Behavior

The message shall point to the first step that failed and how to get its logs.

Actual Behavior

The message points to the last step that failed.
Additionally the message to get the logs includes an extra "new line" that breaks the copy/paste.

When a taskrun has output resources, steps are appended to process those. If a step during the task fails, all the appended resource steps are marked as failed e.g.

    state:
      terminated:
        containerID: docker://35d721aed3233f6217a63f97661afdd8d344b33718744ab374992c19c6cf8487
        exitCode: 1
        finishedAt: "2020-02-10T13:16:57Z"
        message: |
          {"level":"info","ts":1581340399.0176625,"logger":"fallback-logger","caller":"logging/config.go:69","msg":"Fetch GitHub commit ID from kodata failed: open /var/run/ko/HEAD: no such file or directory"}
          2020/02/10 13:16:57 Skipping step because a previous step failed
        reason: Error
        startedAt: "2020-02-10T13:13:19Z"
``

# Steps to Reproduce the Problem

1. https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/triggers-release-nightly-sk54v

# Additional Info
One option here is to fix the logic to get the first failed stepped instead of the last - however I think the current logic relies on the fact that only one step may be failed, which - at least for now - seems reasonable. I wonder if we should leave these steps added by resources as "unknown" instead of marking them as "failed" when they are skipped.
@afrittoli
Copy link
Member Author

/kind bug
/priority urgent

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 10, 2020
@vdemeester vdemeester added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 10, 2020
@GregDritschler
Copy link
Contributor

I have been looking into this issue. There are two problems reported.

  1. The PipelineRun "for logs run..." message does not point to the first failing step.
  2. The message has an extra newline.

For the first problem, pkg/pod/status.go does sort the container status by finish time (State.Terminated.FinishedAt) to try to find the first failure. However this field has a resolution of seconds. In my experiments the steps that are skipped usually finish in the same second as the step that failed. This defeats the intention of the sort. By the way this problem can be observed without using output resources. Any user steps that follow the failed step get the skipped message too.

My initial thought to fix this would be to store a higher-precision finish time in the pod's termination message and copy that to the termination status, imitating what happens with the start time but with a higher-precision timestamp. One side effect is that the higher-precision timestamp would show up externally in the finishedAt field. It might be desirable to change startedAt have the same precision so the formats look alike.

I have not been able to reproduce the second problem. I see that the pod status has the "skipping" message you've shown above. However I don't see any effect on the PipelineRun's "for logs run" message, i.e. I don't see a newline in it. Can you show an example of what you are seeing?

@afrittoli
Copy link
Member Author

afrittoli commented Mar 13, 2020

@GregDritschler Thank you for looking into this!
This is the full spec of the taskRun (with status) where I originally saw the issue: https://gist.github.com/25a4fb15cb0e52a28a2c3f8dbbf67878

The problem there is that the step-image-digest-exporter-bhdb6 is marked as the cause of the failure of the taskrun, while in reality a step in the task failed, and step-image-digest-exporter-bhdb6 was never even executed.

        - lastTransitionTime: "2020-02-10T13:16:59Z"
          message: '"step-image-digest-exporter-bhdb6" exited with code 1 (image:
            "docker-pullable://gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/imagedigestexporter@sha256:1cf3f27f3ff7c73782d8a65853e8fc7f0d4aafc6443893e0150bdbe614a9169d");
            for logs run: kubectl -n default logs triggers-release-nightly-sk54v-publish-images-9fpzt-pod-8wcvh
            -c step-image-digest-exporter-bhdb6'
          reason: Failed
          status: "False"
          type: Succeeded

I've been trying to reproduce this against Tekton v0.10.1, and I've not been able to yet. I used a taskrun directly, while in the original case the taskrun was triggered as part of a pipeline, but I cannot image it would make a difference. This is the taskrun I used as a reproducer:

  apiVersion: tekton.dev/v1alpha1
  kind: TaskRun
  metadata:
    generateName: a-task-to-reproduce-tekton-pipelines-issue-2029-
  spec:
    taskSpec:
      outputs:
        resources:
        - name: pr
          type: pullRequest
        - name: ci
          type: image
        - name: notification
          type: cloudEvent
      steps:
      - name: this-step-does-something-it-sleeps-30-seconds
        image: busybox
        script: |
          #!/bin/sh
          set -ex
          echo "Wait for a bit"
          sleep 30
      - name: fail
        image: busybox
        script: |
          #!/bin/sh
          set -ex
          echo "This fails"
          false
      - name: never-runs
        image: busybox
        script: |
          #!/bin/sh
          set -ex
          echo "I never get to run"
    outputs:
      resources:
      - name: pr
        resourceSpec:
          type: pullRequest
          params:
          - name: url
            value: https://github.com/tektoncd/pipeline/pull/100
      - name: ci
        resourceSpec:
            type: image
            params:
            - name: url
              value: fake-registry/test/fake-image
      - name: notification
        resourceSpec:
            type: cloudEvent
            params:
            - name: targetURI
              value: http://el-pipeline-release-post-processing.default.svc.cluster.local:8080

What I still see though is the issue with the message to access log broken by a new line:

  conditions:
  - lastTransitionTime: "2020-03-13T11:20:41Z"
    message: '"step-fail" exited with code 1 (image: "docker.io/library/busybox@sha256:b26cd013274a657b86e706210ddd5cc1f82f50155791199d29b9e86e935ce135");
      for logs run: kubectl -n default logs a-task-to-reproduce-tekton-pipelines-issue-2029-lpdpk-pod-scpn9
      -c step-fail'
    reason: Failed
    status: "False"
    type: Succeeded

For this to happen the pod name generated by the taskrun must be long enough, so you can either have a long taskrun name or a taskrun within a pipelinerun, so that names are concatenated and become very long.

I will make another attempt to reproduce this, and if it fails I will close it.
I think we could split out the second issue in a dedicated one - it's not terrible, but it forces one to copy/paste twice to get the logs of a step, which I find rather inconvenient.

@afrittoli
Copy link
Member Author

Moved the second part of the issue to #2221

@afrittoli
Copy link
Member Author

afrittoli commented Mar 13, 2020

I tried to reproduce the issue using a pipeline but I could not reproduce it either.

@afrittoli
Copy link
Member Author

I have been looking into this issue. There are two problems reported.

1. The PipelineRun "for logs run..." message does not point to the first failing step.

2. The message has an extra newline.

For the first problem, pkg/pod/status.go does sort the container status by finish time (State.Terminated.FinishedAt) to try to find the first failure. However this field has a resolution of seconds. In my experiments the steps that are skipped usually finish in the same second as the step that failed. This defeats the intention of the sort. By the way this problem can be observed without using output resources. Any user steps that follow the failed step get the skipped message too.

This is an interesting observation, thank you.

I believe that any step after the failed one, whether or not the belong to pipeline resources, should be skipped and not failed, so there should not a "first failure", only "the failure".
In case multiple failures could happen, we should have a better mechanism in place to hint at all the steps that failed in the error message.
We could still add the ability to handle multiple failure there @vdemeester @bobcatfish thoughts?

@afrittoli afrittoli removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 13, 2020
@afrittoli
Copy link
Member Author

@GregDritschler thanks for your analysis on this, I realize now that I've not been able to reproduce this because of the timing in my pipeline. The issue does exists, and I believe it's worth solving, even if pipeline resources are not part of the beta.

The sorting by status introduced as a solution to #1905, however as you pointed out, the time resolution may not be enough to correctly identify the correct step that failed when relying on the time alone.

So I think the problem is that we do not encode enough information in the steps - at pod level - to distinguish a failed one from a skipped one. The pod for a skipped test looks like this:

                "containerID": "containerd://54195e79b3e02d67ce14cb21ed04513ce0694f7072da2f7d1fee6f0bf41793a7",
                "image": "sha256:fd5c69b7e36d96dcd25ed34c523062c5d0c47fc62bc84e49dde0146425492404",
                "imageID": "gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/pullrequest-init@sha256:3783254c379b286dd0987674160d3e19a95be1ccf0985788d6dcc0f159199095",
                "lastState": {},
                "name": "step-pr-source-pr-cxmlh",
                "ready": false,
                "restartCount": 0,
                "state": {
                    "terminated": {
                        "containerID": "containerd://54195e79b3e02d67ce14cb21ed04513ce0694f7072da2f7d1fee6f0bf41793a7",
                        "exitCode": 1,
                        "finishedAt": "2020-03-13T12:04:36Z",
                        "reason": "Error",
                        "startedAt": "2020-03-13T12:04:00Z"
                    }
                }

Which has no indication whatsoever about the step being skipped.
That information is encoded in the corresponding taskrun in the sense that only steps that were executed are tracked in the taskrun's status.steps.

I think the solution to this should involve finding the only step that actually failed. As of today it is not possible for more than one step to fail: even if multiple containers are marked as failed, only one is failed because of a failed step execution.

A couple of ways we could achieve that:

  • We could use the Reason field in the pod to provide more context, e.g. instead of Error we could say Step skipped.
  • We could change the method signature to pass the TaskRun status, and filter the pod containers to include only those that have a matching step in TaskRun.status.steps.

@GregDritschler
Copy link
Contributor

We could use the Reason field in the pod to provide more context, e.g. instead of Error we could say Step skipped.
How? The Reason field in the pod status is set by Kubernetes. How would you alter it?

We could change the method signature to pass the TaskRun status, and filter the pod containers to include only those that have a matching step in TaskRun.status.steps.
What method signature?

The Taskrun step status is built from the K8 container status. The only way I've seen to alter what's in there is via what was done to adjust the start time via the task results. It's ugly though.

@afrittoli
Copy link
Member Author

We could use the Reason field in the pod to provide more context, e.g. instead of Error we could say Step skipped.
How? The Reason field in the pod status is set by Kubernetes. How would you alter it?

Good point. I guess we could extend entrypoint to allow some logic that writes to the termination log, but that might be a bit too much to solve the issue.

We could change the method signature to pass the TaskRun status, and filter the pod containers to include only those that have a matching step in TaskRun.status.steps.
What method signature?

I mean the signature of [getFailureMessage](https://github.com/tektoncd/pipeline/blob/4e4884fb82de205569305e762b44741ebf2c77d2/pkg/pod/status.go#L278). If getFailureMessage was passed the taskRun spec by its called, it would be able to use it to determine which of the failed steps was the first one to be started and this the first one to fail (since they run sequentially). updateCompletedTaskRun only gets the task status, but it could be extended to include the spec.

The Taskrun step status is built from the K8 container status. The only way I've seen to alter what's in there is via what was done to adjust the start time via the task results. It's ugly though.

@GregDritschler
Copy link
Contributor

Well, status.go is already sorting the taskrun step status according to the task spec. The problem with this is that the internally-generated steps fall to the bottom of the list. So if the code were changed to use the taskrun step status to find the first failing step, it could still misidentify the first failing step when an internally-generated step failed.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 13, 2020
@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants