-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken "message" on last transition in case of failure #2029
Comments
/kind bug |
I have been looking into this issue. There are two problems reported.
For the first problem, pkg/pod/status.go does sort the container status by finish time (State.Terminated.FinishedAt) to try to find the first failure. However this field has a resolution of seconds. In my experiments the steps that are skipped usually finish in the same second as the step that failed. This defeats the intention of the sort. By the way this problem can be observed without using output resources. Any user steps that follow the failed step get the skipped message too. My initial thought to fix this would be to store a higher-precision finish time in the pod's termination message and copy that to the termination status, imitating what happens with the start time but with a higher-precision timestamp. One side effect is that the higher-precision timestamp would show up externally in the finishedAt field. It might be desirable to change startedAt have the same precision so the formats look alike. I have not been able to reproduce the second problem. I see that the pod status has the "skipping" message you've shown above. However I don't see any effect on the PipelineRun's "for logs run" message, i.e. I don't see a newline in it. Can you show an example of what you are seeing? |
@GregDritschler Thank you for looking into this! The problem there is that the
I've been trying to reproduce this against Tekton v0.10.1, and I've not been able to yet. I used a taskrun directly, while in the original case the taskrun was triggered as part of a pipeline, but I cannot image it would make a difference. This is the taskrun I used as a reproducer:
What I still see though is the issue with the message to access log broken by a new line:
For this to happen the pod name generated by the taskrun must be long enough, so you can either have a long taskrun name or a taskrun within a pipelinerun, so that names are concatenated and become very long. I will make another attempt to reproduce this, and if it fails I will close it. |
Moved the second part of the issue to #2221 |
I tried to reproduce the issue using a pipeline but I could not reproduce it either. |
This is an interesting observation, thank you. I believe that any step after the failed one, whether or not the belong to pipeline resources, should be skipped and not failed, so there should not a "first failure", only "the failure". |
@GregDritschler thanks for your analysis on this, I realize now that I've not been able to reproduce this because of the timing in my pipeline. The issue does exists, and I believe it's worth solving, even if pipeline resources are not part of the beta. The sorting by status introduced as a solution to #1905, however as you pointed out, the time resolution may not be enough to correctly identify the correct step that failed when relying on the time alone. So I think the problem is that we do not encode enough information in the steps - at pod level - to distinguish a failed one from a skipped one. The pod for a skipped test looks like this:
Which has no indication whatsoever about the step being skipped. I think the solution to this should involve finding the only step that actually failed. As of today it is not possible for more than one step to fail: even if multiple containers are marked as failed, only one is failed because of a failed step execution. A couple of ways we could achieve that:
|
We could use the Reason field in the pod to provide more context, e.g. instead of Error we could say Step skipped. We could change the method signature to pass the TaskRun status, and filter the pod containers to include only those that have a matching step in TaskRun.status.steps. The Taskrun step status is built from the K8 container status. The only way I've seen to alter what's in there is via what was done to adjust the start time via the task results. It's ugly though. |
Good point. I guess we could extend
I mean the signature of
|
Well, status.go is already sorting the taskrun step status according to the task spec. The problem with this is that the internally-generated steps fall to the bottom of the list. So if the code were changed to use the taskrun step status to find the first failing step, it could still misidentify the first failing step when an internally-generated step failed. |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Expected Behavior
The message shall point to the first step that failed and how to get its logs.
Actual Behavior
The message points to the last step that failed.
Additionally the message to get the logs includes an extra "new line" that breaks the copy/paste.
When a
taskrun
has output resources, steps are appended to process those. If a step during the task fails, all the appended resource steps are marked as failed e.g.The text was updated successfully, but these errors were encountered: