-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip: e2e: safer timeout test (less flakey 🙏) #691
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -284,7 +284,6 @@ func (c *Reconciler) reconcile(ctx context.Context, tr *v1alpha1.TaskRun) error | |||
} | |||
} else { | |||
// Pod is not present, create pod. | |||
go c.timeoutHandler.WaitTaskRun(tr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
52b552a
to
a0d1dd7
Compare
/test pull-tekton-pipeline-integration-tests |
2 similar comments
/test pull-tekton-pipeline-integration-tests |
/test pull-tekton-pipeline-integration-tests |
Damn, it failed once still… 😓 /test pull-tekton-pipeline-integration-tests |
/test pull-tekton-pipeline-integration-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to help out with this but I couldn't get to the bottom of it either! I noticed some things that might help tho
Ideas:
- Maybe we can encapsulate some of this functionality in functions to make it a bit clearer exactly what's happening (unless this is an anti-pattern!)
- Can we add some log statements when the timeouts are setup etc.? might help with debugging
Some other things I noticed were wrong:
- We use different "resync" periods in the PipelineRun and TaskRun controllers, TaskRun uses GetTrackerLease, PipelineRun uses 30 minutes
- I don't think there is any reason to be en-queuing PipelineRuns on timeout, i.e. i think we could completely get rid of the PipelineRun timeout handling logic: the TaskRun handling logic works b/c
checkTimeout
is called at the TaskRun level on every reconcile, but as far as I can tell, there is no PipelineRun equivalent - all timeouts are handled at the TaskRun level, so we are just re-enqueuing these forever (and I did this locally and the tests continued passing!)
Questions:
- Why do we use the status lock whenever we're touching the status of an object? the actual updating doesn't happen when the values are changed, so even with locks, we can still overwrite changes
go c.timeoutHandler.WaitPipelineRun(pr) | ||
started := make(chan struct{}) | ||
go c.timeoutHandler.WaitPipelineRun(pr, started) | ||
<-started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that using WaitPipelineRun
could be a little less error prone maybe if we wrapped it, something like this:
In the reconciler:
if !pr.HasStarted() {
c.timeoutHandler.StartPipelineRunWait()
}
In timeout handler:
func (t *TimeoutSet) StartPipelineRunWait(pr *v1alpha1.PipelineRun, started chan struct{}) {
started := make(chan struct{})
go t.waitPipelineRun(pr, started)
<-started
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i wouldnt be surprised if this is an anti-pattern - hiding the fact that we're creating a go routine! - im a bit of a channel / goroutine noob)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bobcatfish why would it be an anti-pattern ? context.Context
and other packages uses that (aka "hide" goroutine/channel usage as much as then can)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!! that's perfect then :D :D :D
@@ -202,7 +209,7 @@ func (t *TimeoutSet) WaitTaskRun(tr *v1alpha1.TaskRun) { | |||
|
|||
// WaitPipelineRun function creates a blocking function for pipelinerun to wait for | |||
// 1. Stop signal, 2. pipelinerun to complete or 3. pipelinerun to time out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this comment be updated to include a description of the new param? 😇 as a channel noob I couldn't tell right away whether WaitPipelineRun
or the caller were writing to started
😇
@@ -214,6 +221,9 @@ func (t *TimeoutSet) WaitPipelineRun(pr *v1alpha1.PipelineRun) { | |||
timeout -= runtime | |||
finished := t.getOrCreateFinishedChan(pr) | |||
|
|||
timeAfter := time.After(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another option here (instead of passing in started
) could be to pass the timeAfter
into this goroutine? e.g.
func (t *TimeoutSet) WaitPipelineRun(pr *v1alpha1.PipelineRun, timeAfter <-chan Time) {
// Then we wouldn't need to write to `started`, we could use `timeAfter` directly in the `case`?
}
This would mean the caller would have to create timeAfter
- maybe we could wrap WaitPipelineRun
and create timeAfter
in the wrapping function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, that might look better even 👼
I want to see if the problem with the taskrun test timing out is b/c this is a TaskRun (since the PipelineRun timeouts pass) or if it has something to do with the times we are using, so I'm creating another TaskRun timeout tests that tries to use the same values the PipelineRun test uses, but without the PipelineRun.
a0d1dd7
to
5763cc3
Compare
- Do not start two go routines 😓, my bad, I messed up a rebase on my part brought an additional timeout goroutine 🙇. - Use a channel (started) to make sure we start the timeout timer in time at the time we issue the `go …` call. When using the `go` keyword to start a goroutines, there is no guarantee the code inside the go routine will start right away. The scheduler might (and most likely will) wait for the main goroutine (or the caller goroutine) to have a waiting/sleeping time, to start working in the issued go routine. This means, that before that fix, we have no guarantee we started the timer at the right time — especially if the controller is very busy. Passing a channel and waiting for it to be closed just after the `go …` call forces the scheduler to sleep and run the goroutine's code. Which, in our case, that we started the timeout timer at the right time. Signed-off-by: Vincent Demeester <[email protected]>
Signed-off-by: Vincent Demeester <[email protected]>
Signed-off-by: Vincent Demeester <[email protected]>
5763cc3
to
dcefc7e
Compare
@vdemeester: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Is this PR still relevant? I think it can probably be closed now right? |
oh good point @dlorenc, it's not relevant anymore 😅 |
@vdemeester: Closing this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Changes
Do not start two go routines sweat, my bad, I messed up a rebase on my
part brought an additional timeout goroutine bowing_man (23af2de).
Use a channel (started) to make sure we start the timeout timer in
time at the time we issue the
go …
call.When using the
go
keyword to start a goroutines, there is noguarantee the code inside the go routine will start right away. The
scheduler might (and most likely will) wait for the main
goroutine (or the caller goroutine) to have a waiting/sleeping time,
to start working in the issued go routine.
This means, that before that fix, we have no guarantee we started
the timer at the right time — especially if the controller is very
busy.
Passing a channel and waiting for it to be closed just after the
go …
call forces the scheduler to sleep and run the goroutine'scode. Which, in our case, that we started the timeout timer at the
right time.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
[ ] Includes docs (if user facing)See the contribution guide
for more details.
Release Notes