Update taskrun/pipelinerun timeout logic to not rely on resync behavior #604

shashwathi · 2019-03-11T15:55:13Z

Changes

In this PR each new taskrun/pipelinerun starts goroutine that waits for either
stop signal, finish or timeout to occur. Once run objects times out handler adds
the object into respective controller queues. When run controllers are restarted new goroutines are being created to track existing timeouts. Mutexes added to safely access runtime object status.
Same timeout handler is used for pipelinerun / taskrun so mutex has prefix "TaskRun" and "PipelineRun" to differentiate the keys.

why: As the number of taskruns and pipelineruns increase the controllers cannot handle the number of reconciliations triggered. One of the solutios to tackle this problems is to increase the resync period to 10h instead of 30sec. This solution manifests a problem for taskrun/pipelinerun timeouts because this implementation relied on the resync behavior to update run status to "Timeout".

I drew inspiration from @tzununbekov PR in knative/build. Credit to
@pivotal-nader-ziada @dprotaso for suggesting level based reconciliation.

Fixes: #456

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices

See the contribution guide
for more details.

cc @bobcatfish @vdemeester @imjasonh

tekton-robot · 2019-03-11T15:55:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shashwathi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [shashwathi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shashwathi · 2019-03-11T16:42:31Z

e2e test failure msg on TestDAGPipelineRun test

  message: 'pod status "PodScheduled":"False"; message: "pod has unbound immediate
I0311 16:33:52.537]             PersistentVolumeClaims (repeated 3 times)"

Is it possible that there is some resource limitation on gke project?
@vdemeester @bobcatfish ?

@tzununbekov

what: In this PR each new taskrun/pipelinerun starts goroutine that waits for either stop signal, finish or timeout to occur. Once run times out handler adds the object into respective controller queues. When run controllers are restarted new goroutines are being created to track existing timeouts. Mutexes added to safely update statuses. Same timeout handler is used for pipelinerun / taskrun so mutex has prefix "TaskRun" and "PipelineRun" to differentiate the keys. why: As the number of taskruns and pipelineruns increase the controllers cannot handle the number of reconciliations triggered. One of the solutios to tackle this problems is to increase the resync period to 10h instead of 30sec. This solution manifests a problem for taskrun/pipelinerun timeouts because this implementation relied on the resync behavior to update run status to "Timeout". I drew inspiration from @tzununbekov PR in knative/build. Credit to @pivotal-nader-ziada @dprotaso for suggesting level based reconciliation.

dlorenc · 2019-03-12T02:46:41Z

/test pull-tekton-pipeline-integration-tests

tekton-robot · 2019-03-12T03:16:41Z

@shashwathi: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-tekton-pipeline-integration-tests	`ddcd154`	link	`/test pull-tekton-pipeline-integration-tests`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

vdemeester

Backported some comments from the previous PR 👼

Instead of my refactoring suggestion, we could do sthg opposite, like

func WaitTaskRun(…) {
    key := getTaskrunKey(tr.Namespace, tr.Name)
    return waitFor(key, tr.Spec.Timeout, tr.Status.StartTime, t.stopTaskRunFunc(tr))
}

func WaitPipelineRun(…) {
    key := getPipelinerunKey(pr.Namespace, pr.Name)
    return waitFor(key, pr.Spec.Timeout, pr.Status.StartTime, t.stopPipelineRunFunc(pr))
}

Looking good otherwise 👼

The build failure is a bit weird though…

vdemeester · 2019-03-12T16:47:53Z

pkg/reconciler/timeout_handler.go

+	t.StatusUnlock(key)
+	timeout -= runtime
+
+	var finished chan bool


Maybe this can be extracted (as it's common for TaskRun and PipelineRun) to also hide the use of doneMut.

finished := getOrCreateFinishedChan(key) // […] func getOrCreateFinishedChan(key string) chan bool { var finished chan bool doneMut.Lock() if existingfinishedChan, ok := done[key]; ok { finished = existingfinishedChan } else { finished = make(chan bool) } done[key] = finished doneMut.Unlock() return finished }

vdemeester · 2019-03-12T16:48:43Z

pkg/reconciler/v1alpha1/pipelinerun/pipelinerun.go

@@ -164,7 +168,11 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error {
 	pr := original.DeepCopy()
 	pr.Status.InitializeConditions()

-	if isDone(&pr.Status) {
+	if pr.Status.IsDone() {
+		statusMapKey := fmt.Sprintf("%s/%s", pipelineRunControllerName, key)


We may want to use getPipelineRunKey (same for TaskRun) to make sure we use the same key always 👼

vdemeester · 2019-03-12T16:49:53Z

pkg/reconciler/v1alpha1/taskrun/taskrun.go

@@ -161,7 +167,11 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error {
 	tr := original.DeepCopy()
 	tr.Status.InitializeConditions()

-	if isDone(&tr.Status) {
+	if tr.Status.IsDone() {
+		statusMapKey := fmt.Sprintf("%s/%s", taskRunControllerName, key)


Same here 👼

tekton-robot requested review from dlorenc and vdemeester March 11, 2019 15:55

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2019

googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Mar 11, 2019

tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 11, 2019

shashwathi force-pushed the run-timeout branch from ee15ca9 to 0316b31 Compare March 11, 2019 16:16

shashwathi force-pushed the run-timeout branch from 0316b31 to ddcd154 Compare March 11, 2019 19:00

vdemeester requested changes Mar 12, 2019

View reviewed changes

shashwathi closed this Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update taskrun/pipelinerun timeout logic to not rely on resync behavior #604

Update taskrun/pipelinerun timeout logic to not rely on resync behavior #604

shashwathi commented Mar 11, 2019 •

edited

Loading

tekton-robot commented Mar 11, 2019

shashwathi commented Mar 11, 2019

dlorenc commented Mar 12, 2019

tekton-robot commented Mar 12, 2019

vdemeester left a comment

vdemeester Mar 12, 2019

vdemeester Mar 12, 2019

vdemeester Mar 12, 2019

Update taskrun/pipelinerun timeout logic to not rely on resync behavior #604

Update taskrun/pipelinerun timeout logic to not rely on resync behavior #604

Conversation

shashwathi commented Mar 11, 2019 • edited Loading

Changes

Submitter Checklist

tekton-robot commented Mar 11, 2019

shashwathi commented Mar 11, 2019

dlorenc commented Mar 12, 2019

tekton-robot commented Mar 12, 2019

vdemeester left a comment

Choose a reason for hiding this comment

vdemeester Mar 12, 2019

Choose a reason for hiding this comment

vdemeester Mar 12, 2019

Choose a reason for hiding this comment

vdemeester Mar 12, 2019

Choose a reason for hiding this comment

shashwathi commented Mar 11, 2019 •

edited

Loading