-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit events for all TaskRun lifecycle events #2329
Conversation
4f12861
to
9972bff
Compare
One of the reason why certain tests fail here, is that we test on the number of actions executed in certain error cases, and this patch adds one "emit event" for the start case. This is roughly the ordering of events in the controller today.
Issues that I see:
We could move the start event to later on, but then it would not match with the start-time set in the taskrun anymore. Also later on it is a bit less obvious when a a
This is maybe a cosmetic thing, but I think we should move the timeout check somewhere else, maybe at the beginning of
This is not new, and not something that we should change in this PR, but the fact that we might associate events to failures on getting the task / validations adds more to the problem. The table in diagram format (perhaps less readable, but roughly same content): |
The more I think about this, the more I tend to think that we should emit events at the very start of the TaskRun i.e. when the controller first looks at it, and emit failures in case of validation failure - since that is something that task/pipeline editors could look for when testing their work. If there is no opposing opinion this week I'll move forward under this assumption, as I'd really like to move on with this series of PR swiftly now. |
9972bff
to
410252f
Compare
The following is the coverage report on pkg/.
|
cb7ae3b
to
6859fb8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me! Just some minor comments on code organization
Am I right in guessing that your overall goal is to try to align the k8s events we emit with potentially CloudEvents as well?
pkg/reconciler/taskrun/taskrun.go
Outdated
// Emit event | ||
afterCondition := tr.Status.GetCondition(apis.ConditionSucceeded) | ||
reconciler.EmitEvent(c.Recorder, beforeCondition, afterCondition, tr) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do these need to be functions on the reconciler? maybe we could make a new struct and initialize it with the function to emit events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put them on the reconciler to give them access to anything that they may need from it, but indeed I was thinking of changing the implementation there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to remove these functions - at least until I actually need to use them.
Emitting events is already non-blocking - down in the implementation of generateEvent
.
So there was little value in adding functions here.
Besides, fetching the afterCondition in a goroutine triggered a data race in the unit tests, which is a good sign that I should not do that.
I like the table a lot but im not quite sure how to interpret the "requeue" and "notes" columns, maybe a bit more detail on those? i wonder if there is some way to isolate the logic that is used to emit events to make it a bit simpler, e.g. treat reconciling as having phases (i havent looked at your other PR @afrittoli which reorganizes reconciling so maybe you're doing this already!) but something like:
Maybe we could look at the status the Runs and use that to decide what events to emit, i.e. isolate the logic for this all in one place instead of throughout the reconcile Sounds like the logic around the start time is a bit more subtle than what above is capturing tho 🤔 |
Yeah, I've been trying to organize things a bit more in the other PR - I would welcome feedback there. Looking at how other resources behave I think I need some changes in the PR related to the start event:
For pipeline we will similarly have:
|
@bobcatfish going back on your point of organizing the reconcile in phases, I think it might be nice to have use different conditions during reconcile to identify them, and send events on transitions. The only downside might be more updates to the status. This is something I would prefer to do after this patch though. We should also add a doc page about different conditions we have today and their meaning. |
4aed1a8
to
fe5538d
Compare
The following is the coverage report on the affected files.
|
Nice increase in coverage, I only need to fix the data race now... |
fe5538d
to
8d24f0f
Compare
The following is the coverage report on the affected files.
|
/retest |
8d24f0f
to
a851d0c
Compare
The following is the coverage report on the affected files.
|
/test pull-tekton-pipeline-integration-tests |
/test pull-tekton-pipeline-build-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/meow
pkg/reconciler/event.go
Outdated
// | ||
// Status "ConditionUnknown": | ||
// beforeCondition == nil, emit EventReasonStarted | ||
// beforeCondition != mil, emit afterCondition.Reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/mil/nil
😝
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of my feedback is super minor! Mostly comments and a couple typos.
The only piece of feedback that's at all substantial is that I'd really like to find away to avoid the sleep in the reconciler test. I don't mind merging this as is but if we could find an alternative to that I think it's worth the effort (or at least a detailed comment explaining why it's unavoidable - imo sleeps, especially in unit tests, should not be the norm)
successfully, including post-steps injected by Tekton. | ||
- `Failed`: this is triggered if the `TaskRun` is completed, but not successfully. | ||
Causes of failure may be: one the steps failed, the `TaskRun` was cancelled or | ||
the `TaskRun` timed out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if it was skipped? (maybe to be added?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A TaskRun is something that may only happen when a Pipeline is involved, so I did not look into that on this patch, where I focuses on events generated by the TaskRun controller.
There are two kind of skips:
- skip because of a condition not met - which is easy, and we definitely can emit an event for that. I opened an issue which is related [feature] Conditions emit no events like other Tekton resources #2461
- skip because it was not reached in the DAG: we don't do anything today we tasks that were not reached in the DAG because of previous failures, but it should be possible to process them as a post-run step in the pipelinerun reconciler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right, there's no taskrun at all, makes sense. those 2 skips are totally valid also. like you said, conditions aren't emitting events anyway, maybe we can revisit this then.
thanks!!
} | ||
case <-timer.C: | ||
if !ts.expectEvent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm a bit confused by the cases that don't require an event, do you think it might make sense to have an event every time the condition changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only cases where we do not emit an event are:
- there was no change in condition (except for LastTransitionTime)
- afterCondition == nil
For the former I think we do not want an event, it may be we wrote the condition again and updated LastTransitionTime, but nothing really happened to the Condition, so no event.
For the latter, I could not think of a use case where afterCondition is nil, so I saw no point in adding that case now.
@@ -271,6 +272,29 @@ func getTaskRunController(t *testing.T, d test.Data) (test.Assets, func()) { | |||
}, cancel | |||
} | |||
|
|||
func checkEvents(fr *record.FakeRecorder, testName string, wantEvents []string) error { | |||
timer := time.NewTimer(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the timer for and why 1 second? maybe a comment to explain? (im guessing that since this is a "unit"-ish test that this timeout is extremely unlikely to occur)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fake recorder runs in a go routine, so the timeout is just there to avoid waiting on the channel forever if fewer than expected events are received. I guess it could be reduced, but it will only be reached in case of test failure, so I don't think it matters really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable! i do think a comment might make sense tho?
Yeah, good catch - this is actually one of the reasons I changed how we run event related unit tests - and finally I forgot to drop the sleep. My bet is that this is not needed anymore, so I will re-spin this without it and with fixes to typos. |
a851d0c
to
f9b4a17
Compare
The following is the coverage report on the affected files.
|
/test pull-tekton-pipeline-build-tests |
Emit events for additional TaskRun lifecyle events: - taskrun started - taskrun running - taskrun timeout Fix the logic in events.go to compare semantic equality as opposed to raw pointer equality. Fix broken EmitEvents unit tests and extend them to cover new functionality. Extend reconcile test to verify new events are sent. To do so, get the event recorder from the context when creating the controller - if avaialble. This allows using the fake recorder for testing instead of having to look for event related actions in the fake client go action list. Add documentation on events. Fixes tektoncd#2328 Work towards tektoncd#2082
f9b4a17
to
450cbc2
Compare
The following is the coverage report on the affected files.
|
I think there were 1 or 2 comments that might make sense to add but nothing worth blocking over! Thanks for this @afrittoli !! :D 🎉 /lgtm |
Changes
Emit events for additional TaskRun lifecyle events:
Fix the logic in events.go to compare semantic equality as
opposed to raw pointer equality.
Fix broken EmitEvents unit tests and extend them to cover new
functionality.
Extend reconcile test to verify new events are sent. To do so,
get the event recorder from the context when creating the
controller - if avaialble. This allows using the fake recorder
for testing instead of having to look for event related actions
in the fake client go action list.
Add documentation on events.
Fixes #2328
Work towards #2082
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.
Release Notes