Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple TaskRun startTime from pod start time ⌚ #780

Merged
merged 3 commits into from
Apr 23, 2019

Conversation

bobcatfish
Copy link
Collaborator

@bobcatfish bobcatfish commented Apr 22, 2019

Changes

Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

  1. The timeout handler decided the run timed out, and triggered a
    reconcile
  2. The reconcile checked if the run timed out, decided the run had a few
    more milliseconds of execution time and let it go
  3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a StartTime that has been set by TaskRun.InitializeConditions,
but after that, the pod is started and the TaskRun.Status is updated to use the pod's
StartTime instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in #734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

This PR also includes some refactoring to make the timeout logic more similar between PipelineRuns and TaskRuns, see the separate commits if you don't want to look at these changes all at once :)

Verified by

I hacked in my change from #771 to run the timeout test 10 times and:

  • Ran it a bunch of times locally - they all passed!
  • Ran the 10 test version once in this PR - it passed!

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide
for more details.

Release Notes

- Fixes bug where sometimes TaskRun timeouts are dropped and the TaskRun may continue executing indefinitely
- TaskRun startTime is now determined by the start of reconciliation and is not tied to the creation time of the underlying Pod(s)

I originally tried to completely remove the PipelineRun timeout handler
- but it turns out that actually both controllers _do_ check for
timeouts, they just do it differently. I hope in a follow up PR to make
them more similar, but in the meantime, I refactored the timeout handler
logic such that the handler itself is the same for both kinds of runs,
and added the same `HasStarted` function for TaskRuns that PipelineRuns
have.

Also:
- Added more logs (which was how I figured out the problem with tektoncd#731 in
the first place!)
- Followed some linting guidelines, fixing typos, adding keyed fields
@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2019
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 22, 2019
@bobcatfish bobcatfish added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Apr 22, 2019
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.

So what would happen was:

1. The timeout handler decided the run timed out, and triggered a
   reconcile
2. The reconcile checked if the run timed out, decided the run had a few
   more milliseconds of execution time and let it go
3. And the TaskRun would just keep running

It looks like the root cause is that when the go routine starts,
it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`,
but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_
`StartTime` instead - which is what the Reconcile loop will use from
that point forward, causing the slight drift.

This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in tektoncd#734 we are looking to be
more tolerant of pods not immediately scheduling anyway).

Fixes tektoncd#731

Co-authored-by: Nader Ziada <[email protected]>
@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no and removed cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit labels Apr 22, 2019
@bobcatfish bobcatfish added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Apr 22, 2019
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no and removed cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit labels Apr 22, 2019
@bobcatfish bobcatfish added cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit and removed cla: no labels Apr 22, 2019
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

Since we are no longer trying to access the `Status` values in our
go-routines (we now pass the start time directly into the go routine)
, there is no longer any reason to provide locks around accesses to `Status`.
@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no and removed cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit labels Apr 23, 2019
@bobcatfish
Copy link
Collaborator Author

(the on-going CLA issue is because I co-authored a commit with @nader-ziada )

@bobcatfish bobcatfish added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Apr 23, 2019
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

@googlebot googlebot removed the cla: no label Apr 23, 2019
@bobcatfish bobcatfish marked this pull request as ready for review April 23, 2019 00:06
@@ -336,7 +337,6 @@ func updateStatusFromPod(taskRun *v1alpha1.TaskRun, pod *corev1.Pod) {
})
}

taskRun.Status.StartTime = &pod.CreationTimestamp
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this is the line that actually fixes the problem XD

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/me like ! it simplifies the code even 💃

@@ -180,7 +180,6 @@ func TestPipelineRunTimeout(t *testing.T) {

// TestTaskRunTimeout is an integration test that will verify a TaskRun can be timed out.
func TestTaskRunTimeout(t *testing.T) {
t.Skip("Flakey test, tracked in https://github.com/tektoncd/pipeline/issues/731")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😻

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2019
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [bobcatfish,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot merged commit fab388a into tektoncd:master Apr 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants