-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic in controller when step fails before image digest exporter #2222
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this so quickly!!!
- Even tho this fixes a panic, if it's possible to get some tests in before merging I'd prefer? Can be convinced to merge anyway tho
TerminationMessagePolicy
is referenced in some builders still, i think we can remove that now?
pkg/pod/status.go
Outdated
logger.Errorf("Could not parse json message %q because of %w", msg, err) | ||
break | ||
// step generated non-JSON termination message; ignore | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change seems a bit inconsistent with the other error handling in this function, which will still be logging and break
ing, 2 thoughts:
- are we sure we want to continue processing messages when this happens? altho we dont want to panic, maybe it's reasonably to assume that encountering a badly formatted message indicates the rest are bogus, or is it that we could have some steps with good termination messages, and some with bad?
- maybe logging here still makes sense? if you want to track down what happened later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note / scope creep: why are we creating a new logger in this function? if this is in the controller, can we pass in the (properly configured) logger instead? <-- probably the material for a separate issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change seems a bit inconsistent with the other error handling in this function, which will still be logging and breaking
This is true, though the other break
s in this function, lines 136 & 148, are inside another nested for
loop. That means when those lines are hit the outer for
loop is allowed to continue on processing the rest of the steps.
are we sure we want to continue processing messages when this happens? altho we dont want to panic, maybe it's reasonably to assume that encountering a badly formatted message indicates the rest are bogus, or is it that we could have some steps with good termination messages, and some with bad?
My feeling is that this issue is evidence of some steps with good termination messages, and some with bad
. There are other ways that a Step might inadvertently break the termination message and I think we can afford to be a little more robust in that scenario. Off the top of my head one example where this could happen again would be that a process spawned inside the container could race with the entrypoint to write the termination log and clobber it. That would be some awful user error but it isn't inconceivable, and I think Tekton should be able to handle that gracefully without abandoning the rest of the steps.
maybe logging here still makes sense? if you want to track down what happened later?
Yeah that's totally fair. I'll add this back in and look at threading the Logger through from the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've threaded the logger through from the controller and added the log line back in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've factored the inner for loop into its own function. I think this makes the control flow a little clearer. I've also added a test for the panic-inducing case.
pkg/pod/status.go
Outdated
} | ||
return false | ||
}) | ||
return taskRunSteps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this part of fixing the issue or is it a separate improvement? if the latter, maybe a separate PR makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the panic originates here. I have two thoughts related to this :-
- we shouldn't really be leaving a sort func lying around that panics in the face of a length mismatch.
- this sorting code felt very complicated when compared to its intended purpose. I spent a good amount of time just trying to puzzle out what it was trying to achieve when the out-of-bounds error occurred. I would guess that the number of steps in a task is always going to be relatively small and I'm not convinced that the performance optimization this func was implementing was really worthwhile (even if the number of steps went into the hundreds or thousands... I don't think this would have changed the runtime really dramatically).
I guess we could approach this a couple of different ways (that I can think of, happy to go a different way if there are more I don't mention):
- as done here, sort the statuses and push any that don't appear in taskRunSteps to the back of the list. I should probably also add a comment describing this behaviour.
- check if len(taskRunSteps) and len(taskSpecSteps) are equal. If not, return the list unsorted. Leave the stepStateSorter in place. comment about the constraint.
- document the panic with a comment.
- leave alone for now and revisit in a separate PR.
I'm kinda happy with the change as it stands (though it needs a comment) but I also don't mind if you think we should tackle this in a separate PR or a totally different way. WDYT?
Edit: secret option number 5 - change this func's signature to return an error as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've added a comment describing the new sort behaviour. still happy to take this change out or modify if preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (and note): sort.Slice
uses reflection (whereas sort.Sort
doesn't, rely on the interface
contract). This means, it's gonna be slower — might be a premature optimization thinking that though :stuck_out_tongue:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put the original sort func back. Don't want to hold up this PR if that change is controversial.
The following is the coverage report on pkg/.
|
The following is the coverage report on pkg/.
|
The following is the coverage report on pkg/.
|
I wonder it #2029 was a different manifestation of this issue. In that case the overall Task failure status was set to that of the skipped image digest exporter step, which is incorrect since the failure was in a different step, and the digest exporter step should have been only skept - see the full yaml. |
The following is the coverage report on pkg/.
|
The following is the coverage report on pkg/.
|
/test pull-tekton-pipeline-integration-tests |
The following is the coverage report on pkg/.
|
The image digest exporter (part of the Image Output Resource) is configured with `"terminationMessagePolicy": "FallbackToLogsOnError",`. When a previous step has failed in a Task our entrypoint wrapping the exporter emits a log line like `2020/03/13 12:03:26 Skipping step because a previous step failed`. Because the image digest exporter is set to FallbackToLogsOnError Kubernetes slurps up this log line as the termination message. That line gets read by the Tekton controller, which is looking for JSON in the termination message. It fails to parse and stops trying to read step statuses. That in turn results in a mismatch in the length of the list of steps and the length of the list of step statuses. Finally we attempt to sort the list of step statuses alongside the list of steps. This method panics with an out of bounds error because it assumes the lengths of the two lists are the same. So, this PR does the following things: 1. The image digest exporter has the FallbackToLogsOnError policy removed. I can't think of a reason that we need this anymore. 2. The Tekton controller no longer breaks out of the loop while it's parsing step statuses and instead simply ignores non-JSON termination messages.
The following is the coverage report on pkg/.
|
The following is the coverage report on pkg/.
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/meow
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates and explanations! Looks great! I had two minor pieces of feedback but also happy to merge as is!
/lgtm
// updateStatusStartTime searches for a result called "StartedAt" in the JSON-formatted termination message | ||
// of a step and sets the State.Terminated.StartedAt field to this time if it's found. The "StartedAt" result | ||
// is also removed from the list of results in the container status. | ||
func updateStatusStartTime(s *corev1.ContainerStatus) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i love that this is a separate function! i dont mind if you want to ignore this, but since its separate id expect to see unit tests for it, and also i find that once we have a private/unexported function that 1) can have meaningful tests and 2) has good reason to have a docstring, it often makes sense to move it into its own package and make it public/exported
@@ -600,6 +603,94 @@ func TestMakeTaskRunStatus(t *testing.T) { | |||
}}, | |||
}, | |||
}, | |||
}, { | |||
desc: "non-json-termination-message-with-steps-afterwards-shouldnt-panic", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
@@ -624,7 +715,8 @@ func TestMakeTaskRunStatus(t *testing.T) { | |||
}, | |||
} | |||
|
|||
got := MakeTaskRunStatus(tr, pod, v1alpha1.TaskSpec{}) | |||
logger, _ := logging.NewLogger("", "status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for threading the logger through! its kind of a pain but i think it's right way to do it :D
break | ||
} | ||
if err := updateStatusStartTime(&s); err != nil { | ||
logger.Errorf("error setting the start time of step %q in taskrun %q: %w", s.Name, tr.Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay so we're just logging an error but continuing onward - i could see including a comment here about why, since it might look like an oversight if just reading the code
Changes
Fixes #2220
The image digest exporter (part of the Image Output Resource) is configured with
"terminationMessagePolicy": "FallbackToLogsOnError",
.When a previous step has failed in a Task our entrypoint wrapping the exporter emits a log line like
2020/03/13 12:03:26 Skipping step because a previous step failed
. Because the image digest exporter is set toFallbackToLogsOnError
Kubernetes slurps up this log line as the termination message.That line gets read by the Tekton controller, which is looking for JSON in the termination message. It fails to parse and stops trying to read step statuses.
That in turn results in a mismatch in the length of the list of steps and the length of the list of step statuses. Finally we attempt to sort the list of step statuses alongside the list of steps. This method panics with an out of bounds error because it assumes the lengths of the two lists are the same.
So, this PR does the following things:
FallbackToLogsOnError
policy removed. I can't think of a reason that we need this anymore.Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.