The timeoutHandler is only instructed to wait when it creates pods #2905

mattmoor · 2020-07-06T19:46:32Z

/kind bug

Consider this logic in taskrun.go:

		pod, err = c.createPod(ctx, tr, rtr)
		if err != nil {
			newErr := c.handlePodCreationError(ctx, tr, err)
			logger.Error("Failed to create task run pod for task %q: %v", tr.Name, newErr)
			return newErr
		}
		go c.timeoutHandler.WaitTaskRun(tr, tr.Status.StartTime)

If after creating the Pod the controller process is restarted (e.g. to rollout a new controller binary) then the timeout handler won't properly timeout task runs straddling this event because the controller's resync won't trigger the go routine.

The pattern we have used successfully in Knative is to have the library support idempotent invocation, so that WaitTaskRun can be called on every resync (as appropriate) and simply have it elide redundant invocations. This is something that would shake out pretty quickly with chaos testing (knative/pkg#1333 👀 ).

The text was updated successfully, but these errors were encountered:

bobcatfish · 2020-07-24T19:48:48Z

I was very excited to fix this bug but I'm actually no longer sure it's a bug:

When the controllers start, they call CheckTimeouts (pipelinerun, taskrun (they probably dont BOTH need to call it but oh well XD) which looks for taskruns and pipelineruns and starts the waiting go routine or times them out as needed

Looks like this was originally addressed in #456

That being said, having WaitTaskRun be idempotent makes a lot of sense and it's definitely not obvious that this is happening, so I think there's some improvement to make here, but I don't think it's a bug.

bobcatfish · 2020-07-24T19:52:28Z

Thinking about this a bit more, even making the call idempotent won't handle the issue being described:

If after creating the Pod the controller process is restarted (e.g. to rollout a new controller binary) then the timeout handler won't properly timeout task runs straddling this event because the controller's resync won't trigger the go routine.

The call to WaitTaskRun wouldn't happen unless the reconcile loop occurs: will a newly started controller attempt to reconcile all instances of the types it is interested in, or will it only be invoked when a change occurs - do you know @mattmoor ?

If it's only when the instance changes, we'll still need to query for them all and check them on start up. If the reconciler tries to reconcile all types it is interested in on start up, then the solution you describe would work.

While investigating tektoncd#2905, I struggled to understand how the timeout handling works, especially with TimeoutSet having very little comments, so I've added some. I didn't add anything for backoffs yet because I'm hoping we can separate that into a separate structure since it has a very specific purpose that doesn't generalize to all timeouts. Also changed the name "finished" to consistently use "done" so the reader doesn't have to wonder about the difference between "finished" and "done" (there isn't one)

mattmoor · 2020-07-24T20:21:30Z

will a newly started controller attempt to reconcile all instances of the types it is interested in, or will it only be invoked when a change occurs

Yes, we also do this when a controller becomes the leader of a particular bucket of the key space.

bobcatfish · 2020-07-24T20:23:20Z

Yes, we also do this when a controller becomes the leader of a particular bucket of the key space.

oh great!! @mattmoor are you able to point me toward the code where that happens?

mattmoor · 2020-07-24T20:40:28Z

When informers are populating their cache they call OnAdd for each key as it is being read the first time, so your informer event hooks will queue stuff in the workqueue and the controller will process it.

While investigating #2905, I struggled to understand how the timeout handling works, especially with TimeoutSet having very little comments, so I've added some. I didn't add anything for backoffs yet because I'm hoping we can separate that into a separate structure since it has a very specific purpose that doesn't generalize to all timeouts. Also changed the name "finished" to consistently use "done" so the reader doesn't have to wonder about the difference between "finished" and "done" (there isn't one)

bobcatfish · 2020-07-28T14:28:28Z

When informers are populating their cache they call OnAdd for each key as it is being read the first time, so your informer event hooks will queue stuff in the workqueue and the controller will process it.

do you have a link to the code where this happens? (or some docs?) i can go looking for it if not - basically if we're going to remove our attempt to manually reconcile all the timeouts, which seems totally unnecessary now, i want to be able to point at the justification for why we don't need it anymore

The timeout handler was trying to do two different jobs: - Track timeouts for executing TaskRuns and PipelineRuns; when they timeout, they should be re-reconciled, at which point they would fail b/c they took to long - Track backoffs when TaskRun pod creation fails due to resource quota issues. In this case we want to retry pod creation (by re-reconciling) after a backoff in case the resource issue has been resolved. The logic for these 2 jobs was combined because they both share at their heart very similar logic: the logic for creating go routines that wait until a period of time has passed. Now these two jobs are done by 2 different structs, which both use the new Timer struct and logic, so they can share this logic, without having to mix the code on top. Future things we can do to make this even better: - Add more test coverage for the new structs - The "Handler" struct (which may have too generic a name at this point) probably doesn't need to have separate logic for pipelineruns + taskruns since each of these controllers will instanitate it separately - As discussed in tektoncd#2905, knative/pkg's controller logic will attempt to re-reconcile all the Runs on start up, so we probably don't need to be explicitly doing this as well

bobcatfish · 2020-09-09T15:24:24Z

I think @imjasonh 's suggestion is ultimately the best way to go here: #3078 (review)

I think there's a chance we can simplify a lot of this by using impl.EnqueueAfter to enqueue a reconciliation in the future when the timeout should be elapsed

I wish I could remember why we didn't do it originally. I'm guessing that either we didn't realize we could (most likely?) or we were avoiding an extra reconcile (doesnt seem worth the complexity)

We added the timeout logic as it currently stands in #621 to explicitly NOT rely on reconciling - because our first version relied on re-reconciling every 30s and catching a timeout that way. I'm not sure if we didn't think of timing a future re-reconcile or if the functionality didn't exist at the time?

Anyway this seems like the way to go!

mattmoor · 2020-09-09T15:30:53Z

do you have a link to the code where this happens?

Sorry, this fell off my radar. I'm not sure what code ptr you are looking for, but it's probably deep in the informer/cache/lister code in k8s.io/client-go.

our first version relied on re-reconciling every 30s and catching a timeout that way

Woof, this won't scale. Generally here I'd recommend designing an idempotent library that moves this work off the main Reconciler thread. The Reconciler thread can call into the idempotent non-blocking library to access the result. The library should take a callback to enqueue a key when it thinks the state has changed. We use this pattern fairly successfully for our network prober in Knative (also workqueue based): https://github.com/knative/networking/blob/1b3a7517e759e94d813eb5440194413e8e2c603f/pkg/status/status.go#L131

bobcatfish · 2020-09-09T16:56:25Z

@mattmoor what do you think about @imjasonh 's suggestion to call impl.EnqueueAfter? When a TaskRun or PipelineRuns starts we can do something like:

Reconcile, start pods, etc.
Call EnqueueAfter to trigger a re-reconcile after the timeout expires
At this point either:
a. the TaskRun or PipelineRun will complete successfully: when the EnqueueAfter expires, it will re-reconcile but see there is nothing to do
b. the TaskRun or PipelineRun will be still be running when EnqueueAfter expires, it will re-reconciler, see that the timeout has occurred, and stop it

mattmoor · 2020-09-09T17:01:53Z

tl;dr 👍

I'd probably still encapsulate that in a library to avoid having the Reconciler need to know about enqueue functions directly. It's generally better for testing and whatnot too.

You can probably rely on the workqueue itself to make it idempotent, but I think the key thing is that every reconcile (not just "created") should call into that library to avoid the workqueue becoming SPOF state.

bobcatfish · 2020-09-09T19:27:26Z

I think the key thing is that every reconcile (not just "created") should call into that library to avoid the workqueue becoming SPOF state.

hmm it seems like something is still going to have to know if items have been enqueued or not 🤔

mattmoor · 2020-09-09T20:21:34Z

Not if it's idempotent, then it can blindly add it and rely on the workqueue to dedup 😉

bobcatfish · 2020-09-10T16:48:35Z

Not if it's idempotent, then it can blindly add it and rely on the workqueue to dedup 😉

It would only be idempotent if we know the workqueue will dedupe, right?

There are 2 things ill want to track down and verify before relying on this:

That we can guarantee everything will be re-reconciled on startup
That the workqueue will dedupe (im not sure it would?? how would it know you didnt mean to enqueue twice - esp if there was a mistake and the After was off by milliseconds or something 🤔 )

mattmoor · 2020-09-10T19:22:19Z

Syncing the world on startup (or when elected leader) is pretty fundamental, so if you don't see this it is a bug and we should drill in.

The workqueue will deduplicate any key that isn't in-flight; which is key to reacting properly to informer events mid-reconcile. I'm not sure how the workqueue handles different delays, but for simple enqueues it should be fine and I'd sort of expect it to deduplicate with the smallest delay.

bobcatfish · 2020-09-11T17:48:23Z

hmmm so it sounds like we'd be in a situation something like this:

task run created by user
task run added to workqueue
pipeline controller starts reconciling task run
pipeline controller calls EnqueueAfter for timeout
pipeline controller finishes reconcile, creates pod, etc....
... later....
something changes, e.g. pod status updated
task run added to workqueue (at this point, the existing enqueue casued by EnqueueAfter is deduped, and removed)
pipeline controller starts reconciling task run
pipeline controller calls EnqueueAfter for timeout
pipeline controller finishes reconcile
... etc ...

i.e. every time the controller reconciles the TaskRun/PipelineRun, it needs to call impl.EnqueueAfter because any previous calls are moot

is that right? that seems less reliable to me than our current method i.e. starting a goroutine and relying on that - if we ever failed to call impl.EnqueueAfter correctly on a re-reconcile we'd lose the timeouts completely

mattmoor · 2020-09-11T17:52:50Z

Right, because imagine if 6. is <controller crashes and a new controller picks it up>. If during that Reconcile it identifies that it should be processed at time=X to handle timing things out, then it should blindly call EnqueueAfter.

If the controller is in a good steady state, then the idempotency will make it a nop. If it isn't, then it just saved your bacon.

bobcatfish · 2020-09-21T19:10:01Z

@imjasonh kindly volunteered to pick this up with the goal to update it with enough information that we could put a "good first issue" label on it

imjasonh · 2020-09-21T19:20:22Z

I'll start with a summary of the discussion so far, please anybody hop in with more questions/clarifications/links to source as you see fit.

The current thinking (correct me if I'm wrong!) is that we can use EnqueueAfter to schedule a reconciliation at a point in the future when the run should be complete, and have reconciliation logic simply 1.) check whether the timeout is elapsed, and if so trigger existing timeout behavior (update status, delete pod, etc.), and 2.) exit early if a run IsDone since there's nothing to reconcile in that case. (2) should always be done anyway, since reconciliations for done runs can happen already today (e.g., on startup).

Relying on EnqueueAfter does introduce a new reconciliation for every run, but since the vast majority of those will be no-ops to reconcile already-completed runs, this shouldn't be too painful.

This also makes us reliant on the specific behavior of the workqueue, which we'd like to understand better. Rather than spawning a goroutine in the reconciler/controller code, we'd be offloading that "remind me later" to the external workqueue. This means that if we're behind on processing the workqueue, timeouts might be enforced later than they otherwise should be, and later than they might be if we retain our goroutine-based timeout enforcement.

However, offloading to the workqueue should make us more reliable than we are today in cases when the controller crashes or gets a normal update, since the state of "I am waiting to check on a run's timeout" is encoded in the deferred workqueue item, and not in a spawned-but-waiting goroutine. It also means we can delete our own code and streamline this area of the reconciler.

@mattmoor @bobcatfish Is there any more context I'm missing? Next I'll outline the code changes we'd make to move to EnqueueAfter, and mark this as a [good first issue].

mattmoor · 2020-11-05T04:51:47Z

Alrighty, so this is the root cause of at least some of the timeout_test flakes I have been chasing.

From a recent failure's logstream:

    kubelogs.go:197: I 03:38:39.163 controlplane-2 [timeout/handler.go:278] [arendelle-652kd/task-run-timeout-xovuhquv] timer for "arendelle-652kd/task-run-timeout-xovuhquv" has activated after 26.852349086s
    kubelogs.go:197: I 03:38:39.357 controlplane-2 [taskrun/taskrun.go:351] [arendelle-652kd/task-run-timeout-xovuhquv] Cloud Events: []
    kubelogs.go:197: I 03:38:39.564 controlplane-2 [taskrun/taskrun.go:441] [arendelle-652kd/task-run-timeout-xovuhquv] Successfully reconciled taskrun task-run-timeout-xovuhquv/arendelle-652kd with status: &apis.Condition{Type:"Succeeded", Status:"Unknown", Severity:"", LastTransitionTime:apis.VolatileTime{Inner:v1.Time{Time:time.Time{wall:0x0, ext:63740144300, loc:(*time.Location)(0x3722a00)}}}, Reason:"Running", Message:"Not all Steps in the Task have finished executing"}

Looking at the log line from the timeout handler a bit closer... This log statement is prior to calling the callback to enqueue a taskrun that we expect to fail with a timeout, and this test sets a 30s timeout, so the start time should be 03:38:09, but it is actually 03:38:15.

So the callback is executed ~6 second prematurely, and because the non-timeout reconcile doesn't do an idempotent requeue, handling of timeouts for this key is permanently dropped on the floor.

mattmoor · 2020-11-05T05:17:44Z

I really think that the bulk of this logic could simply be handled by something along the lines of this defer (in taskrun.go):

	// Check if the TaskRun has timed out; if it is, this will set its status
	// accordingly.
	if tr.HasTimedOut() {
		message := fmt.Sprintf("TaskRun %q failed to finish within %q", tr.Name, tr.GetTimeout())
		err := c.failTaskRun(ctx, tr, v1beta1.TaskRunReasonTimedOut, message)
		return c.finishReconcileUpdateEmitEvents(ctx, tr, before, err)
	}
	defer func() {
		if tr.Status.StartTime == nil || tr.Spec.Timeout == nil {
			return
		}
		// Compute the time since the task started.
		elapsed := time.Since(tr.Status.StartTime.Time)
		// enqueue this taskrun's key after that long.
		enqueueAfter(tr, tr.Spec.Timeout.Duration.Sub(elapsed))
	}()

There is some logic around backoff that is probably better rationalized in terms of changes to how "StartTime" is set (since if there are pod creation errors, it hasn't really started to effect timeout handling IMO).

PipelineRun could probably get away with something similar, though I haven't cracked that code today to look.

mattmoor added the kind/bug Categorizes issue or PR as related to a bug. label Jul 6, 2020

vdemeester added this to the Pipelines v0.15 milestone Jul 7, 2020

bobcatfish self-assigned this Jul 13, 2020

bobcatfish added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 24, 2020

bobcatfish removed this from the Pipelines v0.15 milestone Jul 24, 2020

bobcatfish mentioned this issue Jul 24, 2020

Some timeout refactoring #3011

Merged

3 tasks

bobcatfish mentioned this issue Jul 29, 2020

Move backoff logic into own struct and increase test coverage of timeout logic #3031

Closed

3 tasks

afrittoli added this to the Pipelines v0.16 milestone Aug 10, 2020

bobcatfish mentioned this issue Sep 8, 2020

Actually stop trying to time out finished Runs ⏰ #3078

Merged

3 tasks

bobcatfish modified the milestones: Pipelines v0.16, Pipelines v0.17 Sep 11, 2020

bobcatfish removed this from the Pipelines v0.17 milestone Sep 21, 2020

bobcatfish assigned imjasonh and unassigned bobcatfish Sep 21, 2020

mattmoor mentioned this issue Nov 5, 2020

Refactor the way timeouts are handled #3500

Merged

4 tasks

tekton-robot closed this as completed in #3500 Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The timeoutHandler is only instructed to wait when it creates pods #2905

The timeoutHandler is only instructed to wait when it creates pods #2905

mattmoor commented Jul 6, 2020

bobcatfish commented Jul 24, 2020

bobcatfish commented Jul 24, 2020

mattmoor commented Jul 24, 2020

bobcatfish commented Jul 24, 2020

mattmoor commented Jul 24, 2020

bobcatfish commented Jul 28, 2020

bobcatfish commented Sep 9, 2020

mattmoor commented Sep 9, 2020

bobcatfish commented Sep 9, 2020

mattmoor commented Sep 9, 2020

bobcatfish commented Sep 9, 2020

mattmoor commented Sep 9, 2020

bobcatfish commented Sep 10, 2020

mattmoor commented Sep 10, 2020

bobcatfish commented Sep 11, 2020

mattmoor commented Sep 11, 2020

bobcatfish commented Sep 21, 2020

imjasonh commented Sep 21, 2020

mattmoor commented Nov 5, 2020

mattmoor commented Nov 5, 2020

The timeoutHandler is only instructed to wait when it creates pods #2905

The timeoutHandler is only instructed to wait when it creates pods #2905

Comments

mattmoor commented Jul 6, 2020

bobcatfish commented Jul 24, 2020

bobcatfish commented Jul 24, 2020

mattmoor commented Jul 24, 2020

bobcatfish commented Jul 24, 2020

mattmoor commented Jul 24, 2020

bobcatfish commented Jul 28, 2020

bobcatfish commented Sep 9, 2020

mattmoor commented Sep 9, 2020

bobcatfish commented Sep 9, 2020

mattmoor commented Sep 9, 2020

bobcatfish commented Sep 9, 2020

mattmoor commented Sep 9, 2020

bobcatfish commented Sep 10, 2020

mattmoor commented Sep 10, 2020

bobcatfish commented Sep 11, 2020

mattmoor commented Sep 11, 2020

bobcatfish commented Sep 21, 2020

imjasonh commented Sep 21, 2020

mattmoor commented Nov 5, 2020

mattmoor commented Nov 5, 2020