Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update taskrun/pipelinerun timeout logic to not rely on resync behavior #604

Closed
wants to merge 1 commit into from

Conversation

shashwathi
Copy link
Contributor

@shashwathi shashwathi commented Mar 11, 2019

Changes

In this PR each new taskrun/pipelinerun starts goroutine that waits for either
stop signal, finish or timeout to occur. Once run objects times out handler adds
the object into respective controller queues. When run controllers are restarted new goroutines are being created to track existing timeouts. Mutexes added to safely access runtime object status.
Same timeout handler is used for pipelinerun / taskrun so mutex has prefix "TaskRun" and "PipelineRun" to differentiate the keys.

why: As the number of taskruns and pipelineruns increase the controllers cannot handle the number of reconciliations triggered. One of the solutios to tackle this problems is to increase the resync period to 10h instead of 30sec. This solution manifests a problem for taskrun/pipelinerun timeouts because this implementation relied on the resync behavior to update run status to "Timeout".

I drew inspiration from @tzununbekov PR in knative/build. Credit to
@pivotal-nader-ziada @dprotaso for suggesting level based reconciliation.

Fixes: #456

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide
for more details.

cc @bobcatfish @vdemeester @imjasonh

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shashwathi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2019
@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Mar 11, 2019
@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 11, 2019
@shashwathi
Copy link
Contributor Author

e2e test failure msg on TestDAGPipelineRun test

  message: 'pod status "PodScheduled":"False"; message: "pod has unbound immediate
I0311 16:33:52.537]             PersistentVolumeClaims (repeated 3 times)"

Is it possible that there is some resource limitation on gke project?
@vdemeester @bobcatfish ?

what:
In this PR each new taskrun/pipelinerun starts goroutine that waits for either
stop signal, finish or timeout to occur. Once run times out handler adds
the object into respective controller queues.
When run controllers are restarted new goroutines are being created to
track existing timeouts. Mutexes added to safely update statuses.
Same timeout handler is used for pipelinerun / taskrun so mutex has
prefix "TaskRun" and "PipelineRun" to differentiate the keys.

why: As the number of taskruns and pipelineruns increase the controllers
cannot handle the number of reconciliations triggered. One of the
solutios to tackle this problems is to increase the resync period to 10h
instead of 30sec. This solution manifests a problem for
taskrun/pipelinerun timeouts because this implementation relied on the
resync behavior to update run status to "Timeout".

I drew inspiration from @tzununbekov PR in knative/build. Credit to
@pivotal-nader-ziada @dprotaso for suggesting level based
reconciliation.
@dlorenc
Copy link
Contributor

dlorenc commented Mar 12, 2019

/test pull-tekton-pipeline-integration-tests

@tekton-robot
Copy link
Collaborator

@shashwathi: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-tekton-pipeline-integration-tests ddcd154 link /test pull-tekton-pipeline-integration-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backported some comments from the previous PR 👼

Instead of my refactoring suggestion, we could do sthg opposite, like

func WaitTaskRun(…) {
    key := getTaskrunKey(tr.Namespace, tr.Name)
    return waitFor(key, tr.Spec.Timeout, tr.Status.StartTime, t.stopTaskRunFunc(tr))
}

func WaitPipelineRun(…) {
    key := getPipelinerunKey(pr.Namespace, pr.Name)
    return waitFor(key, pr.Spec.Timeout, pr.Status.StartTime, t.stopPipelineRunFunc(pr))
}

Looking good otherwise 👼

The build failure is a bit weird though…

t.StatusUnlock(key)
timeout -= runtime

var finished chan bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be extracted (as it's common for TaskRun and PipelineRun) to also hide the use of doneMut.

finished := getOrCreateFinishedChan(key)
// […]
func getOrCreateFinishedChan(key string) chan bool {
	var finished chan bool
	doneMut.Lock()
	if existingfinishedChan, ok := done[key]; ok {
		finished = existingfinishedChan
	} else {
		finished = make(chan bool)
	}
	done[key] = finished
	doneMut.Unlock()
        return finished
}

@@ -164,7 +168,11 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error {
pr := original.DeepCopy()
pr.Status.InitializeConditions()

if isDone(&pr.Status) {
if pr.Status.IsDone() {
statusMapKey := fmt.Sprintf("%s/%s", pipelineRunControllerName, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to use getPipelineRunKey (same for TaskRun) to make sure we use the same key always 👼

@@ -161,7 +167,11 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error {
tr := original.DeepCopy()
tr.Status.InitializeConditions()

if isDone(&tr.Status) {
if tr.Status.IsDone() {
statusMapKey := fmt.Sprintf("%s/%s", taskRunControllerName, key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here 👼

@shashwathi shashwathi closed this Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle build/taskrun timeout outside of resync
5 participants