-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple TaskRun startTime from pod start time ⌚ #780
Conversation
I originally tried to completely remove the PipelineRun timeout handler - but it turns out that actually both controllers _do_ check for timeouts, they just do it differently. I hope in a follow up PR to make them more similar, but in the meantime, I refactored the timeout handler logic such that the handler itself is the same for both kinds of runs, and added the same `HasStarted` function for TaskRuns that PipelineRuns have. Also: - Added more logs (which was how I figured out the problem with tektoncd#731 in the first place!) - Followed some linting guidelines, fixing typos, adding keyed fields
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
A Googler has manually verified that the CLAs look good. (Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.) ℹ️ Googlers: Go here for more info. |
Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <[email protected]>
357a515
to
68b8d93
Compare
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
A Googler has manually verified that the CLAs look good. (Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.) ℹ️ Googlers: Go here for more info. |
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
A Googler has manually verified that the CLAs look good. (Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.) ℹ️ Googlers: Go here for more info. |
Since we are no longer trying to access the `Status` values in our go-routines (we now pass the start time directly into the go routine) , there is no longer any reason to provide locks around accesses to `Status`.
315d2a2
to
8f124a8
Compare
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
(the on-going CLA issue is because I co-authored a commit with @nader-ziada ) |
A Googler has manually verified that the CLAs look good. (Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.) ℹ️ Googlers: Go here for more info. |
@@ -336,7 +337,6 @@ func updateStatusFromPod(taskRun *v1alpha1.TaskRun, pod *corev1.Pod) { | |||
}) | |||
} | |||
|
|||
taskRun.Status.StartTime = &pod.CreationTimestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: this is the line that actually fixes the problem XD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/me like ! it simplifies the code even 💃
@@ -180,7 +180,6 @@ func TestPipelineRunTimeout(t *testing.T) { | |||
|
|||
// TestTaskRunTimeout is an integration test that will verify a TaskRun can be timed out. | |||
func TestTaskRunTimeout(t *testing.T) { | |||
t.Skip("Flakey test, tracked in https://github.com/tektoncd/pipeline/issues/731") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bobcatfish, vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Changes
Our TaskRun timeout end to end test has been intermittently failing
against PRs. After adding a lot of (terrible) log messages, I found out
that the reason TaskRuns weren't timing out was b/c the go routine
checking for a timeout was considering them timed out several
milliseconds before the reconcile loop would.
So what would happen was:
reconcile
more milliseconds of execution time and let it go
It looks like the root cause is that when the go routine starts,
it uses a
StartTime
that has been set byTaskRun.InitializeConditions
,but after that, the pod is started and the
TaskRun.Status
is updated to use the pod'sStartTime
instead - which is what the Reconcile loop will use fromthat point forward, causing the slight drift.
This is fixed by no longer tying the start time of the TaskRun to the
pod: the TaskRun will be considered started as soon as the reconciler
starts to act on it, which is probably the functionality the user would
expect anyway (e.g. if the pod was delayed in being started, this delay
should be subtracted from the timeout, and in #734 we are looking to be
more tolerant of pods not immediately scheduling anyway).
This PR also includes some refactoring to make the timeout logic more similar between PipelineRuns and TaskRuns, see the separate commits if you don't want to look at these changes all at once :)
Verified by
I hacked in my change from #771 to run the timeout test 10 times and:
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide
for more details.
Release Notes