-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The timeoutHandler is only instructed to wait when it creates pods #2905
Comments
I was very excited to fix this bug but I'm actually no longer sure it's a bug: When the controllers start, they call CheckTimeouts (pipelinerun, taskrun (they probably dont BOTH need to call it but oh well XD) which looks for taskruns and pipelineruns and starts the waiting go routine or times them out as needed Looks like this was originally addressed in #456 That being said, having WaitTaskRun be idempotent makes a lot of sense and it's definitely not obvious that this is happening, so I think there's some improvement to make here, but I don't think it's a bug. |
Thinking about this a bit more, even making the call idempotent won't handle the issue being described:
The call to WaitTaskRun wouldn't happen unless the reconcile loop occurs: will a newly started controller attempt to reconcile all instances of the types it is interested in, or will it only be invoked when a change occurs - do you know @mattmoor ? If it's only when the instance changes, we'll still need to query for them all and check them on start up. If the reconciler tries to reconcile all types it is interested in on start up, then the solution you describe would work. |
While investigating tektoncd#2905, I struggled to understand how the timeout handling works, especially with TimeoutSet having very little comments, so I've added some. I didn't add anything for backoffs yet because I'm hoping we can separate that into a separate structure since it has a very specific purpose that doesn't generalize to all timeouts. Also changed the name "finished" to consistently use "done" so the reader doesn't have to wonder about the difference between "finished" and "done" (there isn't one)
Yes, we also do this when a controller becomes the leader of a particular bucket of the key space. |
oh great!! @mattmoor are you able to point me toward the code where that happens? |
When informers are populating their cache they call |
While investigating #2905, I struggled to understand how the timeout handling works, especially with TimeoutSet having very little comments, so I've added some. I didn't add anything for backoffs yet because I'm hoping we can separate that into a separate structure since it has a very specific purpose that doesn't generalize to all timeouts. Also changed the name "finished" to consistently use "done" so the reader doesn't have to wonder about the difference between "finished" and "done" (there isn't one)
do you have a link to the code where this happens? (or some docs?) i can go looking for it if not - basically if we're going to remove our attempt to manually reconcile all the timeouts, which seems totally unnecessary now, i want to be able to point at the justification for why we don't need it anymore |
The timeout handler was trying to do two different jobs: - Track timeouts for executing TaskRuns and PipelineRuns; when they timeout, they should be re-reconciled, at which point they would fail b/c they took to long - Track backoffs when TaskRun pod creation fails due to resource quota issues. In this case we want to retry pod creation (by re-reconciling) after a backoff in case the resource issue has been resolved. The logic for these 2 jobs was combined because they both share at their heart very similar logic: the logic for creating go routines that wait until a period of time has passed. Now these two jobs are done by 2 different structs, which both use the new Timer struct and logic, so they can share this logic, without having to mix the code on top. Future things we can do to make this even better: - Add more test coverage for the new structs - The "Handler" struct (which may have too generic a name at this point) probably doesn't need to have separate logic for pipelineruns + taskruns since each of these controllers will instanitate it separately - As discussed in tektoncd#2905, knative/pkg's controller logic will attempt to re-reconcile all the Runs on start up, so we probably don't need to be explicitly doing this as well
The timeout handler was trying to do two different jobs: - Track timeouts for executing TaskRuns and PipelineRuns; when they timeout, they should be re-reconciled, at which point they would fail b/c they took to long - Track backoffs when TaskRun pod creation fails due to resource quota issues. In this case we want to retry pod creation (by re-reconciling) after a backoff in case the resource issue has been resolved. The logic for these 2 jobs was combined because they both share at their heart very similar logic: the logic for creating go routines that wait until a period of time has passed. Now these two jobs are done by 2 different structs, which both use the new Timer struct and logic, so they can share this logic, without having to mix the code on top. Future things we can do to make this even better: - Add more test coverage for the new structs - The "Handler" struct (which may have too generic a name at this point) probably doesn't need to have separate logic for pipelineruns + taskruns since each of these controllers will instanitate it separately - As discussed in tektoncd#2905, knative/pkg's controller logic will attempt to re-reconcile all the Runs on start up, so we probably don't need to be explicitly doing this as well
I think @imjasonh 's suggestion is ultimately the best way to go here: #3078 (review)
I wish I could remember why we didn't do it originally. I'm guessing that either we didn't realize we could (most likely?) or we were avoiding an extra reconcile (doesnt seem worth the complexity) We added the timeout logic as it currently stands in #621 to explicitly NOT rely on reconciling - because our first version relied on re-reconciling every 30s and catching a timeout that way. I'm not sure if we didn't think of timing a future re-reconcile or if the functionality didn't exist at the time? Anyway this seems like the way to go! |
Sorry, this fell off my radar. I'm not sure what code ptr you are looking for, but it's probably deep in the informer/cache/lister code in
Woof, this won't scale. Generally here I'd recommend designing an idempotent library that moves this work off the main Reconciler thread. The Reconciler thread can call into the idempotent non-blocking library to access the result. The library should take a callback to enqueue a key when it thinks the state has changed. We use this pattern fairly successfully for our network prober in Knative (also workqueue based): https://github.com/knative/networking/blob/1b3a7517e759e94d813eb5440194413e8e2c603f/pkg/status/status.go#L131 |
@mattmoor what do you think about @imjasonh 's suggestion to call
|
tl;dr 👍 I'd probably still encapsulate that in a library to avoid having the Reconciler need to know about enqueue functions directly. It's generally better for testing and whatnot too. You can probably rely on the |
hmm it seems like something is still going to have to know if items have been enqueued or not 🤔 |
Not if it's idempotent, then it can blindly add it and rely on the workqueue to dedup 😉 |
It would only be idempotent if we know the workqueue will dedupe, right? There are 2 things ill want to track down and verify before relying on this:
|
Syncing the world on startup (or when elected leader) is pretty fundamental, so if you don't see this it is a bug and we should drill in. The workqueue will deduplicate any key that isn't in-flight; which is key to reacting properly to informer events mid-reconcile. I'm not sure how the workqueue handles different delays, but for simple enqueues it should be fine and I'd sort of expect it to deduplicate with the smallest delay. |
hmmm so it sounds like we'd be in a situation something like this:
i.e. every time the controller reconciles the TaskRun/PipelineRun, it needs to call is that right? that seems less reliable to me than our current method i.e. starting a goroutine and relying on that - if we ever failed to call |
Right, because imagine if If the controller is in a good steady state, then the idempotency will make it a nop. If it isn't, then it just saved your bacon. |
@imjasonh kindly volunteered to pick this up with the goal to update it with enough information that we could put a "good first issue" label on it |
I'll start with a summary of the discussion so far, please anybody hop in with more questions/clarifications/links to source as you see fit. The current thinking (correct me if I'm wrong!) is that we can use Relying on This also makes us reliant on the specific behavior of the workqueue, which we'd like to understand better. Rather than spawning a goroutine in the reconciler/controller code, we'd be offloading that "remind me later" to the external workqueue. This means that if we're behind on processing the workqueue, timeouts might be enforced later than they otherwise should be, and later than they might be if we retain our goroutine-based timeout enforcement. However, offloading to the workqueue should make us more reliable than we are today in cases when the controller crashes or gets a normal update, since the state of "I am waiting to check on a run's timeout" is encoded in the deferred workqueue item, and not in a spawned-but-waiting goroutine. It also means we can delete our own code and streamline this area of the reconciler. @mattmoor @bobcatfish Is there any more context I'm missing? Next I'll outline the code changes we'd make to move to |
Alrighty, so this is the root cause of at least some of the timeout_test flakes I have been chasing. From a recent failure's logstream:
Looking at the log line from the timeout handler a bit closer... This log statement is prior to calling the callback to enqueue a taskrun that we expect to fail with a timeout, and this test sets a 30s timeout, so the start time should be So the callback is executed ~6 second prematurely, and because the non-timeout reconcile doesn't do an idempotent requeue, handling of timeouts for this key is permanently dropped on the floor. |
I really think that the bulk of this logic could simply be handled by something along the lines of this // Check if the TaskRun has timed out; if it is, this will set its status
// accordingly.
if tr.HasTimedOut() {
message := fmt.Sprintf("TaskRun %q failed to finish within %q", tr.Name, tr.GetTimeout())
err := c.failTaskRun(ctx, tr, v1beta1.TaskRunReasonTimedOut, message)
return c.finishReconcileUpdateEmitEvents(ctx, tr, before, err)
}
defer func() {
if tr.Status.StartTime == nil || tr.Spec.Timeout == nil {
return
}
// Compute the time since the task started.
elapsed := time.Since(tr.Status.StartTime.Time)
// enqueue this taskrun's key after that long.
enqueueAfter(tr, tr.Spec.Timeout.Duration.Sub(elapsed))
}() There is some logic around backoff that is probably better rationalized in terms of changes to how "StartTime" is set (since if there are pod creation errors, it hasn't really started to effect timeout handling IMO). PipelineRun could probably get away with something similar, though I haven't cracked that code today to look. |
/kind bug
Consider this logic in
taskrun.go
:If after creating the Pod the controller process is restarted (e.g. to rollout a new controller binary) then the timeout handler won't properly timeout task runs straddling this event because the controller's resync won't trigger the go routine.
The pattern we have used successfully in Knative is to have the library support idempotent invocation, so that
WaitTaskRun
can be called on every resync (as appropriate) and simply have it elide redundant invocations. This is something that would shake out pretty quickly with chaos testing (knative/pkg#1333 👀 ).The text was updated successfully, but these errors were encountered: