Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: e2e: safer timeout test (less flakey 🙏) #691

Closed
wants to merge 3 commits into from

Conversation

vdemeester
Copy link
Member

@vdemeester vdemeester commented Mar 28, 2019

Changes

  • Do not start two go routines sweat, my bad, I messed up a rebase on my
    part brought an additional timeout goroutine bowing_man (23af2de).

  • Use a channel (started) to make sure we start the timeout timer in
    time at the time we issue the go … call.

    When using the go keyword to start a goroutines, there is no
    guarantee the code inside the go routine will start right away. The
    scheduler might (and most likely will) wait for the main
    goroutine (or the caller goroutine) to have a waiting/sleeping time,
    to start working in the issued go routine.

    This means, that before that fix, we have no guarantee we started
    the timer at the right time — especially if the controller is very
    busy.

    Passing a channel and waiting for it to be closed just after the go … call forces the scheduler to sleep and run the goroutine's
    code. Which, in our case, that we started the timeout timer at the
    right time.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide
for more details.

Release Notes

Fix timeout handling : in the case of a busy controller, it was possible that the timeout timer wouldn't start at the right time — this should be fixed now

@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 28, 2019
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Mar 28, 2019
@tekton-robot tekton-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 28, 2019
@@ -284,7 +284,6 @@ func (c *Reconciler) reconcile(ctx context.Context, tr *v1alpha1.TaskRun) error
}
} else {
// Pod is not present, create pod.
go c.timeoutHandler.WaitTaskRun(tr)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is duplicated with this one — see #674 for reference

@vdemeester vdemeester force-pushed the safer-timeout-test branch 2 times, most recently from 52b552a to a0d1dd7 Compare March 28, 2019 13:55
@vdemeester
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

2 similar comments
@vdemeester
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

@vdemeester
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

@vdemeester
Copy link
Member Author

Damn, it failed once still… 😓

/test pull-tekton-pipeline-integration-tests

@bobcatfish
Copy link
Collaborator

/test pull-tekton-pipeline-integration-tests

Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to help out with this but I couldn't get to the bottom of it either! I noticed some things that might help tho

Ideas:

  • Maybe we can encapsulate some of this functionality in functions to make it a bit clearer exactly what's happening (unless this is an anti-pattern!)
  • Can we add some log statements when the timeouts are setup etc.? might help with debugging

Some other things I noticed were wrong:

  • We use different "resync" periods in the PipelineRun and TaskRun controllers, TaskRun uses GetTrackerLease, PipelineRun uses 30 minutes
  • I don't think there is any reason to be en-queuing PipelineRuns on timeout, i.e. i think we could completely get rid of the PipelineRun timeout handling logic: the TaskRun handling logic works b/c checkTimeout is called at the TaskRun level on every reconcile, but as far as I can tell, there is no PipelineRun equivalent - all timeouts are handled at the TaskRun level, so we are just re-enqueuing these forever (and I did this locally and the tests continued passing!)

Questions:

  • Why do we use the status lock whenever we're touching the status of an object? the actual updating doesn't happen when the values are changed, so even with locks, we can still overwrite changes

go c.timeoutHandler.WaitPipelineRun(pr)
started := make(chan struct{})
go c.timeoutHandler.WaitPipelineRun(pr, started)
<-started
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that using WaitPipelineRun could be a little less error prone maybe if we wrapped it, something like this:

In the reconciler:

if !pr.HasStarted() {
    c.timeoutHandler.StartPipelineRunWait()
}

In timeout handler:

func (t *TimeoutSet) StartPipelineRunWait(pr *v1alpha1.PipelineRun, started chan struct{}) {
		started := make(chan struct{})
		go t.waitPipelineRun(pr, started)
		<-started
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i wouldnt be surprised if this is an anti-pattern - hiding the fact that we're creating a go routine! - im a bit of a channel / goroutine noob)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bobcatfish why would it be an anti-pattern ? context.Context and other packages uses that (aka "hide" goroutine/channel usage as much as then can)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!! that's perfect then :D :D :D

@@ -202,7 +209,7 @@ func (t *TimeoutSet) WaitTaskRun(tr *v1alpha1.TaskRun) {

// WaitPipelineRun function creates a blocking function for pipelinerun to wait for
// 1. Stop signal, 2. pipelinerun to complete or 3. pipelinerun to time out
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this comment be updated to include a description of the new param? 😇 as a channel noob I couldn't tell right away whether WaitPipelineRun or the caller were writing to started 😇

@@ -214,6 +221,9 @@ func (t *TimeoutSet) WaitPipelineRun(pr *v1alpha1.PipelineRun) {
timeout -= runtime
finished := t.getOrCreateFinishedChan(pr)

timeAfter := time.After(timeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option here (instead of passing in started) could be to pass the timeAfter into this goroutine? e.g.

func (t *TimeoutSet) WaitPipelineRun(pr *v1alpha1.PipelineRun, timeAfter <-chan Time) {
   // Then we wouldn't need to write to `started`, we could use `timeAfter` directly in the `case`?
}

This would mean the caller would have to create timeAfter - maybe we could wrap WaitPipelineRun and create timeAfter in the wrapping function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, that might look better even 👼

bobcatfish added a commit to bobcatfish/pipeline that referenced this pull request Apr 1, 2019
I want to see if the problem with the taskrun test timing out is b/c
this is a TaskRun (since the PipelineRun timeouts pass) or if it has
something to do with the times we are using, so I'm creating another
TaskRun timeout tests that tries to use the same values the PipelineRun
test uses, but without the PipelineRun.
@vdemeester vdemeester force-pushed the safer-timeout-test branch from a0d1dd7 to 5763cc3 Compare April 5, 2019 13:36
- Do not start two go routines 😓, my bad, I messed up a rebase on my
  part brought an additional timeout goroutine 🙇.
- Use a channel (started) to make sure we start the timeout timer in
  time at the time we issue the `go …` call.

  When using the `go` keyword to start a goroutines, there is no
  guarantee the code inside the go routine will start right away. The
  scheduler might (and most likely will) wait for the main
  goroutine (or the caller goroutine) to have a waiting/sleeping time,
  to start working in the issued go routine.

  This means, that before that fix, we have no guarantee we started
  the timer at the right time — especially if the controller is very
  busy.

  Passing a channel and waiting for it to be closed just after the `go
  …` call forces the scheduler to sleep and run the goroutine's
  code. Which, in our case, that we started the timeout timer at the
  right time.

Signed-off-by: Vincent Demeester <[email protected]>
Signed-off-by: Vincent Demeester <[email protected]>
@tekton-robot
Copy link
Collaborator

@vdemeester: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-tekton-pipeline-integration-tests dcefc7e link /test pull-tekton-pipeline-integration-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dlorenc
Copy link
Contributor

dlorenc commented May 3, 2019

Is this PR still relevant? I think it can probably be closed now right?

@vdemeester
Copy link
Member Author

oh good point @dlorenc, it's not relevant anymore 😅
/close

@tekton-robot
Copy link
Collaborator

@vdemeester: Closing this PR.

In response to this:

oh good point @dlorenc, it's not relevant anymore 😅
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vdemeester vdemeester deleted the safer-timeout-test branch May 3, 2019 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants