-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update taskrun/pipelinerun timeout logic to not rely on resync behavior #604
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: shashwathi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e2e test failure msg on
Is it possible that there is some resource limitation on gke project? |
what: In this PR each new taskrun/pipelinerun starts goroutine that waits for either stop signal, finish or timeout to occur. Once run times out handler adds the object into respective controller queues. When run controllers are restarted new goroutines are being created to track existing timeouts. Mutexes added to safely update statuses. Same timeout handler is used for pipelinerun / taskrun so mutex has prefix "TaskRun" and "PipelineRun" to differentiate the keys. why: As the number of taskruns and pipelineruns increase the controllers cannot handle the number of reconciliations triggered. One of the solutios to tackle this problems is to increase the resync period to 10h instead of 30sec. This solution manifests a problem for taskrun/pipelinerun timeouts because this implementation relied on the resync behavior to update run status to "Timeout". I drew inspiration from @tzununbekov PR in knative/build. Credit to @pivotal-nader-ziada @dprotaso for suggesting level based reconciliation.
/test pull-tekton-pipeline-integration-tests |
@shashwathi: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backported some comments from the previous PR 👼
Instead of my refactoring suggestion, we could do sthg opposite, like
func WaitTaskRun(…) {
key := getTaskrunKey(tr.Namespace, tr.Name)
return waitFor(key, tr.Spec.Timeout, tr.Status.StartTime, t.stopTaskRunFunc(tr))
}
func WaitPipelineRun(…) {
key := getPipelinerunKey(pr.Namespace, pr.Name)
return waitFor(key, pr.Spec.Timeout, pr.Status.StartTime, t.stopPipelineRunFunc(pr))
}
Looking good otherwise 👼
The build failure is a bit weird though…
t.StatusUnlock(key) | ||
timeout -= runtime | ||
|
||
var finished chan bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this can be extracted (as it's common for TaskRun
and PipelineRun
) to also hide the use of doneMut
.
finished := getOrCreateFinishedChan(key)
// […]
func getOrCreateFinishedChan(key string) chan bool {
var finished chan bool
doneMut.Lock()
if existingfinishedChan, ok := done[key]; ok {
finished = existingfinishedChan
} else {
finished = make(chan bool)
}
done[key] = finished
doneMut.Unlock()
return finished
}
@@ -164,7 +168,11 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error { | |||
pr := original.DeepCopy() | |||
pr.Status.InitializeConditions() | |||
|
|||
if isDone(&pr.Status) { | |||
if pr.Status.IsDone() { | |||
statusMapKey := fmt.Sprintf("%s/%s", pipelineRunControllerName, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to use getPipelineRunKey
(same for TaskRun
) to make sure we use the same key always 👼
@@ -161,7 +167,11 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error { | |||
tr := original.DeepCopy() | |||
tr.Status.InitializeConditions() | |||
|
|||
if isDone(&tr.Status) { | |||
if tr.Status.IsDone() { | |||
statusMapKey := fmt.Sprintf("%s/%s", taskRunControllerName, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here 👼
Changes
In this PR each new taskrun/pipelinerun starts goroutine that waits for either
stop signal, finish or timeout to occur. Once run objects times out handler adds
the object into respective controller queues. When run controllers are restarted new goroutines are being created to track existing timeouts. Mutexes added to safely access runtime object status.
Same timeout handler is used for pipelinerun / taskrun so mutex has prefix "TaskRun" and "PipelineRun" to differentiate the keys.
why: As the number of taskruns and pipelineruns increase the controllers cannot handle the number of reconciliations triggered. One of the solutios to tackle this problems is to increase the resync period to 10h instead of 30sec. This solution manifests a problem for taskrun/pipelinerun timeouts because this implementation relied on the resync behavior to update run status to "Timeout".
I drew inspiration from @tzununbekov PR in knative/build. Credit to
@pivotal-nader-ziada @dprotaso for suggesting level based reconciliation.
Fixes: #456
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide
for more details.
cc @bobcatfish @vdemeester @imjasonh