New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

wip: e2e: safer timeout test (less flakey 🙏) #691

Closed

vdemeester wants to merge 3 commits into tektoncd:master from vdemeester:safer-timeout-test

Member

vdemeester commented Mar 28, 2019 •

edited

Loading

Changes

Do not start two go routines sweat, my bad, I messed up a rebase on my
part brought an additional timeout goroutine bowing_man (23af2de).
Use a channel (started) to make sure we start the timeout timer in
time at the time we issue the go … call.

When using the go keyword to start a goroutines, there is no
guarantee the code inside the go routine will start right away. The
scheduler might (and most likely will) wait for the main
goroutine (or the caller goroutine) to have a waiting/sleeping time,
to start working in the issued go routine.

This means, that before that fix, we have no guarantee we started
the timer at the right time — especially if the controller is very
busy.

Passing a channel and waiting for it to be closed just after the go … call forces the scheduler to sleep and run the goroutine's
code. Which, in our case, that we started the timeout timer at the
right time.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
~~[ ] Includes docs (if user facing)~~
Commit messages follow commit message best practices

See the contribution guide
for more details.

Release Notes

Fix timeout handling : in the case of a busy controller, it was possible that the timeout timer wouldn't start at the right time — this should be fixed now

tekton-robot added the do-not-merge/work-in-progress label

tekton-robot requested review from bobcatfish and dlorenc

March 28, 2019 11:34

Collaborator

tekton-robot commented Mar 28, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

googlebot added the cla: yes label

tekton-robot added approved size/M labels

vdemeester commented

View reviewed changes

pkg/reconciler/v1alpha1/taskrun/taskrun.go

               		}
               	} else {
               		// Pod is not present, create pod.
-              		go c.timeoutHandler.WaitTaskRun(tr)

Member Author

vdemeester Mar 28, 2019

This line is duplicated with this one — see #674 for reference

vdemeester force-pushed the safer-timeout-test branch 2 times, most recently from 52b552a to a0d1dd7 Compare

March 28, 2019 13:55

Member Author

vdemeester commented Mar 28, 2019

/test pull-tekton-pipeline-integration-tests

2 similar comments

Member Author

vdemeester commented Mar 28, 2019

/test pull-tekton-pipeline-integration-tests

Member Author

vdemeester commented Mar 28, 2019

/test pull-tekton-pipeline-integration-tests

Member Author

vdemeester commented Mar 28, 2019

Damn, it failed once still… 😓

/test pull-tekton-pipeline-integration-tests

Collaborator

bobcatfish commented Apr 1, 2019

/test pull-tekton-pipeline-integration-tests

bobcatfish reviewed

View reviewed changes

Collaborator

bobcatfish left a comment •

edited

Loading

I tried to help out with this but I couldn't get to the bottom of it either! I noticed some things that might help tho

Ideas:

Maybe we can encapsulate some of this functionality in functions to make it a bit clearer exactly what's happening (unless this is an anti-pattern!)
Can we add some log statements when the timeouts are setup etc.? might help with debugging

Some other things I noticed were wrong:

We use different "resync" periods in the PipelineRun and TaskRun controllers, TaskRun uses GetTrackerLease, PipelineRun uses 30 minutes
I don't think there is any reason to be en-queuing PipelineRuns on timeout, i.e. i think we could completely get rid of the PipelineRun timeout handling logic: the TaskRun handling logic works b/c checkTimeout is called at the TaskRun level on every reconcile, but as far as I can tell, there is no PipelineRun equivalent - all timeouts are handled at the TaskRun level, so we are just re-enqueuing these forever (and I did this locally and the tests continued passing!)

Questions:

Why do we use the status lock whenever we're touching the status of an object? the actual updating doesn't happen when the values are changed, so even with locks, we can still overwrite changes

pkg/reconciler/v1alpha1/pipelinerun/pipelinerun.go

-              		go c.timeoutHandler.WaitPipelineRun(pr)
+              		started := make(chan struct{})
+              		go c.timeoutHandler.WaitPipelineRun(pr, started)
+              		<-started

Collaborator

bobcatfish Apr 1, 2019

i think that using WaitPipelineRun could be a little less error prone maybe if we wrapped it, something like this:

In the reconciler:

if !pr.HasStarted() {
    c.timeoutHandler.StartPipelineRunWait()
}

In timeout handler:

func (t *TimeoutSet) StartPipelineRunWait(pr *v1alpha1.PipelineRun, started chan struct{}) {
		started := make(chan struct{})
		go t.waitPipelineRun(pr, started)
		<-started
}

Collaborator

bobcatfish Apr 1, 2019

(i wouldnt be surprised if this is an anti-pattern - hiding the fact that we're creating a go routine! - im a bit of a channel / goroutine noob)

Member Author

vdemeester Apr 2, 2019

@bobcatfish why would it be an anti-pattern ? context.Context and other packages uses that (aka "hide" goroutine/channel usage as much as then can)

Collaborator

bobcatfish Apr 15, 2019

awesome!! that's perfect then :D :D :D

pkg/reconciler/timeout_handler.go

		@@ -202,7 +209,7 @@ func (t TimeoutSet) WaitTaskRun(tr v1alpha1.TaskRun) {

		// WaitPipelineRun function creates a blocking function for pipelinerun to wait for
		// 1. Stop signal, 2. pipelinerun to complete or 3. pipelinerun to time out

Collaborator

bobcatfish Apr 1, 2019

could this comment be updated to include a description of the new param? 😇 as a channel noob I couldn't tell right away whether WaitPipelineRun or the caller were writing to started 😇

pkg/reconciler/timeout_handler.go

               	timeout -= runtime
               	finished := t.getOrCreateFinishedChan(pr)
+              	timeAfter := time.After(timeout)

Collaborator

bobcatfish Apr 1, 2019

another option here (instead of passing in started) could be to pass the timeAfter into this goroutine? e.g.

func (t *TimeoutSet) WaitPipelineRun(pr *v1alpha1.PipelineRun, timeAfter <-chan Time) {
   // Then we wouldn't need to write to `started`, we could use `timeAfter` directly in the `case`?
}

This would mean the caller would have to create timeAfter - maybe we could wrap WaitPipelineRun and create timeAfter in the wrapping function

Member Author

vdemeester Apr 2, 2019

interesting, that might look better even 👼

bobcatfish added a commit to bobcatfish/pipeline that referenced this pull request


          WIP investigating timeout flakes in tektoncd#691 🚧

a3cabe8

I want to see if the problem with the taskrun test timing out is b/c
this is a TaskRun (since the PipelineRun timeouts pass) or if it has
something to do with the times we are using, so I'm creating another
TaskRun timeout tests that tries to use the same values the PipelineRun
test uses, but without the PipelineRun.

This was referenced Apr 5, 2019

TestTaskRunTimeout is flakey : fix timeouts #731

Closed

Skip temporarly TestTaskRunTimeout as it is flakey. #732

Merged

vdemeester force-pushed the safer-timeout-test branch from a0d1dd7 to 5763cc3 Compare

April 5, 2019 13:36

vdemeester added 3 commits

April 12, 2019 11:22


          taskrun timeouts: fix timeout goroutines

f177ed8

- Do not start two go routines 😓, my bad, I messed up a rebase on my
  part brought an additional timeout goroutine 🙇.
- Use a channel (started) to make sure we start the timeout timer in
  time at the time we issue the `go …` call.

  When using the `go` keyword to start a goroutines, there is no
  guarantee the code inside the go routine will start right away. The
  scheduler might (and most likely will) wait for the main
  goroutine (or the caller goroutine) to have a waiting/sleeping time,
  to start working in the issued go routine.

  This means, that before that fix, we have no guarantee we started
  the timer at the right time — especially if the controller is very
  busy.

  Passing a channel and waiting for it to be closed just after the `go
  …` call forces the scheduler to sleep and run the goroutine's
  code. Which, in our case, that we started the timeout timer at the
  right time.

Signed-off-by: Vincent Demeester <[email protected]>


          e2e: TestTaskRunTimeout uses variables for task/taskrun names

Signed-off-by: Vincent Demeester <[email protected]>


          Re-enable TestTaskRunTimeout

dcefc7e

Signed-off-by: Vincent Demeester <[email protected]>

vdemeester force-pushed the safer-timeout-test branch from 5763cc3 to dcefc7e Compare

April 12, 2019 09:22

Collaborator

tekton-robot commented Apr 12, 2019

@vdemeester: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-tekton-pipeline-integration-tests	`dcefc7e`	link	`/test pull-tekton-pipeline-integration-tests`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

bobcatfish mentioned this pull request

Remove first WaitTaskRun call when creating TR pod #774

Merged

3 tasks

Contributor

dlorenc commented May 3, 2019

Is this PR still relevant? I think it can probably be closed now right?

Member Author

vdemeester commented May 3, 2019

oh good point @dlorenc, it's not relevant anymore 😅
/close

Collaborator

tekton-robot commented May 3, 2019

@vdemeester: Closing this PR.

In response to this:

oh good point @dlorenc, it's not relevant anymore 😅
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot closed this

vdemeester deleted the safer-timeout-test branch

May 3, 2019 16:01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cla: yes do-not-merge/work-in-progress size/M