Decouple TaskRun startTime from pod start time ⌚ #780

bobcatfish · 2019-04-22T23:09:55Z

Changes

Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

The timeout handler decided the run timed out, and triggered a
reconcile
The reconcile checked if the run timed out, decided the run had a few
more milliseconds of execution time and let it go
And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a StartTime that has been set by TaskRun.InitializeConditions,
but after that, the pod is started and the TaskRun.Status is updated to use the pod's
StartTime instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in #734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

This PR also includes some refactoring to make the timeout logic more similar between PipelineRuns and TaskRuns, see the separate commits if you don't want to look at these changes all at once :)

Verified by

I hacked in my change from #771 to run the timeout test 10 times and:

Ran it a bunch of times locally - they all passed!
Ran the 10 test version once in this PR - it passed!

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
[n/a] Includes docs (if user facing)
Commit messages follow commit message best practices

See the contribution guide
for more details.

Release Notes

- Fixes bug where sometimes TaskRun timeouts are dropped and the TaskRun may continue executing indefinitely
- TaskRun startTime is now determined by the start of reconciliation and is not tied to the creation time of the underlying Pod(s)

I originally tried to completely remove the PipelineRun timeout handler - but it turns out that actually both controllers _do_ check for timeouts, they just do it differently. I hope in a follow up PR to make them more similar, but in the meantime, I refactored the timeout handler logic such that the handler itself is the same for both kinds of runs, and added the same `HasStarted` function for TaskRuns that PipelineRuns have. Also: - Added more logs (which was how I figured out the problem with tektoncd#731 in the first place!) - Followed some linting guidelines, fixing typos, adding keyed fields

googlebot · 2019-04-22T23:09:58Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

googlebot · 2019-04-22T23:10:21Z

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <[email protected]>

googlebot · 2019-04-22T23:17:03Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

googlebot · 2019-04-22T23:19:20Z

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

googlebot · 2019-04-22T23:20:44Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

googlebot · 2019-04-22T23:22:34Z

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

Since we are no longer trying to access the `Status` values in our go-routines (we now pass the start time directly into the go routine) , there is no longer any reason to provide locks around accesses to `Status`.

googlebot · 2019-04-23T00:05:47Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

bobcatfish · 2019-04-23T00:06:13Z

(the on-going CLA issue is because I co-authored a commit with @nader-ziada )

googlebot · 2019-04-23T00:06:21Z

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

bobcatfish · 2019-04-23T00:07:02Z

pkg/reconciler/v1alpha1/taskrun/taskrun.go

@@ -336,7 +337,6 @@ func updateStatusFromPod(taskRun *v1alpha1.TaskRun, pod *corev1.Pod) {
 		})
 	}

-	taskRun.Status.StartTime = &pod.CreationTimestamp


note: this is the line that actually fixes the problem XD

vdemeester

/me like ! it simplifies the code even 💃

vdemeester · 2019-04-23T12:58:13Z

test/timeout_test.go

@@ -180,7 +180,6 @@ func TestPipelineRunTimeout(t *testing.T) {

 // TestTaskRunTimeout is an integration test that will verify a TaskRun can be timed out.
 func TestTaskRunTimeout(t *testing.T) {
-	t.Skip("Flakey test, tracked in https://github.com/tektoncd/pipeline/issues/731")


vdemeester

/lgtm

tekton-robot · 2019-04-23T15:22:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bobcatfish,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot requested review from dlorenc and vdemeester April 22, 2019 23:09

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2019

googlebot added the cla: no label Apr 22, 2019

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 22, 2019

bobcatfish added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Apr 22, 2019

bobcatfish force-pushed the timeout_b_gone branch from 357a515 to 68b8d93 Compare April 22, 2019 23:16

googlebot added cla: no and removed cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit labels Apr 22, 2019

bobcatfish added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Apr 22, 2019

googlebot added cla: no and removed cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit labels Apr 22, 2019

bobcatfish added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Apr 22, 2019

Remove status locks 🔐

8f124a8

Since we are no longer trying to access the `Status` values in our go-routines (we now pass the start time directly into the go routine) , there is no longer any reason to provide locks around accesses to `Status`.

bobcatfish force-pushed the timeout_b_gone branch from 315d2a2 to 8f124a8 Compare April 23, 2019 00:05

googlebot added cla: no and removed cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit labels Apr 23, 2019

bobcatfish added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Apr 23, 2019

googlebot removed the cla: no label Apr 23, 2019

bobcatfish marked this pull request as ready for review April 23, 2019 00:06

bobcatfish commented Apr 23, 2019

View reviewed changes

vdemeester approved these changes Apr 23, 2019

View reviewed changes

tekton-robot assigned vdemeester Apr 23, 2019

vdemeester approved these changes Apr 23, 2019

View reviewed changes

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2019

tekton-robot merged commit fab388a into tektoncd:master Apr 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple TaskRun startTime from pod start time ⌚ #780

Decouple TaskRun startTime from pod start time ⌚ #780

bobcatfish commented Apr 22, 2019 •

edited

Loading

googlebot commented Apr 22, 2019

googlebot commented Apr 22, 2019

googlebot commented Apr 22, 2019

googlebot commented Apr 22, 2019

googlebot commented Apr 22, 2019

googlebot commented Apr 22, 2019

googlebot commented Apr 23, 2019

bobcatfish commented Apr 23, 2019

googlebot commented Apr 23, 2019

bobcatfish Apr 23, 2019

vdemeester left a comment

vdemeester Apr 23, 2019

vdemeester left a comment

tekton-robot commented Apr 23, 2019

Decouple TaskRun startTime from pod start time ⌚ #780

Decouple TaskRun startTime from pod start time ⌚ #780

Conversation

bobcatfish commented Apr 22, 2019 • edited Loading

Changes

Verified by

Submitter Checklist

Release Notes

googlebot commented Apr 22, 2019

googlebot commented Apr 22, 2019

googlebot commented Apr 22, 2019

googlebot commented Apr 22, 2019

googlebot commented Apr 22, 2019

googlebot commented Apr 22, 2019

googlebot commented Apr 23, 2019

bobcatfish commented Apr 23, 2019

googlebot commented Apr 23, 2019

bobcatfish Apr 23, 2019

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

vdemeester Apr 23, 2019

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

tekton-robot commented Apr 23, 2019

bobcatfish commented Apr 22, 2019 •

edited

Loading