Refactor the way timeouts are handled #3500

mattmoor · 2020-11-05T18:35:31Z

/kind cleanup

Fixes: #2905

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices
Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

If you are adding a new binary/image to the cmd dir, please update
the release Task to build and release this image.

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Fixes an issue where TaskRuns and PipelineRuns may not properly timeout.

mattmoor · 2020-11-05T18:36:38Z

Staging this to let Prow test the first chunk as I look at doing the same elsewhere in the codebase. Will split each round into separate commits, if folks want to take a look and give feedback incrementally.

cc @imjasonh @vdemeester @afrittoli

imjasonh · 2020-11-05T18:37:26Z

cc @yaoxiaoqi

mattmoor · 2020-11-05T18:47:14Z

pkg/reconciler/taskrun/taskrun_test.go

@@ -1748,8 +1748,8 @@ func TestReconcileInvalidTaskRuns(t *testing.T) {

 			// Check actions and events
 			actions := clients.Kube.Actions()
-			if len(actions) != 3 || actions[0].Matches("namespaces", "list") {


Note that all of these checks were previously incorrect because they were missing ! and had their arguments transposed. 🙃

mattmoor · 2020-11-05T18:49:22Z

Alright, I have two more commits staged that:

Make the same change to PipelineRun
rm -rf pkg/timeout (this resulted in some dep changes)

I kinda want to see the integration test results before pushing that, but heads up that it's coming (again separate commits).

pkg/reconciler/taskrun/taskrun.go

mattmoor · 2020-11-05T19:01:31Z

ok it passed, I'm going to push the other two commits now and remove the WIP.

mattmoor · 2020-11-05T19:15:39Z

Apparently I didn't delete enough code 🤣

mattmoor · 2020-11-05T20:09:45Z

Alright, e2e tests have passed twice (once with pipelinerun changes).

Hopefully this time everything is green, but I'd appreciate any feedback, so we can hopefully squash the timeout_test flake (or start chasing what's left).

bobcatfish · 2020-11-05T20:35:16Z

Thanks for working on this @mattmoor ! Looking forward to taking a look - tomorrow our team is having an "offsite" so they'll be a delay for me personally at least

so we can hopefully squash the timeout_test flake (or start chasing what's left).

Are you seeing these flakes locally, or in recent PRs, or somewhere else?

imjasonh

Thanks for this change, I think it's going to be a useful simplification, I just have some concerns about resource-exhaustion handling.

AIUI, for my own understanding, the old timeout handling code ended up calling Enqueue anyway, so we were always reliant on the workqueue to schedule the timeout check, and if the workqueue backed up we'd start to fall behind on timeout checks, so this shouldn't introduce any new behavior except that we rely on EnqueueAfter's internal delay mechanism instead of our own. Is that all correct?

pkg/reconciler/pipelinerun/pipelinerun.go

pkg/reconciler/taskrun/taskrun.go

pkg/reconciler/taskrun/taskrun_test.go

mattmoor

AIUI, for my own understanding, the old timeout handling code ended up calling Enqueue anyway,

Yes

so we were always reliant on the workqueue to schedule the timeout check, and if the workqueue backed up we'd start to fall behind on timeout checks,

The problem I saw was that we could actually end up calling the callback and processing the key before the resource was actually timed out[1], and if we miss that one shot the entire timeout handling is shot because we had very edge triggered logic to kick off the timeout handling logic.

[1] - I suspect this is due to jitter in the .status.StartTime due to the stale informer issues we saw previously here: #3460 (with pipelinerun, where we clamped the StartTime to the child TaskRun time), so there is likely more we can do here (cc @pritidesai called this out), but this is certainly worth doing anyways as it's much more resilient than what's there now.

so this shouldn't introduce any new behavior except that we rely on EnqueueAfter's internal delay mechanism instead of our own. Is that all correct?

Essentially yes. In theory we could have used the old method here, but each invocation consumes a NEW go routine (not idempotent), which I suspect would tax the system under load. AFAIK the workqueue doesn't used go routines for EnqueueAfter so I suspect this is more efficient even than what's there, and idempotent to boot so we can blindly EnqueueAfter and let it deduplicate internally.

pkg/reconciler/pipelinerun/pipelinerun.go

pkg/reconciler/taskrun/taskrun.go

pkg/reconciler/taskrun/taskrun_test.go

pkg/reconciler/pipelinerun/pipelinerun.go

imjasonh · 2020-11-06T15:02:13Z

pkg/reconciler/pipelinerun/pipelinerun.go

@@ -797,12 +804,12 @@ func combineTaskRunAndTaskSpecAnnotations(pr *v1beta1.PipelineRun, pipelineTask
 	return annotations
 }

-func getPipelineRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun) metav1.Duration {
+func getPipelineRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun) time.Duration {


We need one of these for TaskRun timeouts too, the same config is used for both (which seems sort of odd to me 🤔 ), so maybe we can just write one GetTimeout(context.Context, *metav1.Duration) time.Duration and share it in both places.

I noticed there is a GetTimeout() above the lines in the taskrun reconciler, so I'm going so use that to handle defaulting, but it just uses the static default, not the configurable default. This should probably be fixed as well?

Alright fixed in a separate commit.

tekton-robot · 2020-11-06T15:21:21Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go	80.3%	75.4%	-4.9
pkg/apis/pipeline/v1beta1/taskrun_types.go	77.6%	76.3%	-1.3

`{Task,Pipeline}Run` now handle timeouts via `EnqueueAfter` on the workqueue. `pkg/timeout` is now removed. We now have consistent `GetTimeout(ctx)` methods on types.

tekton-robot · 2020-11-06T15:27:21Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1beta1/pipelinerun_types.go	80.3%	75.4%	-4.9
pkg/apis/pipeline/v1beta1/taskrun_types.go	77.6%	76.3%	-1.3

vdemeester

/meow

tekton-robot · 2020-11-09T09:09:20Z

@vdemeester:

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot · 2020-11-09T09:09:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dlorenc · 2020-11-09T16:25:44Z

/lgtm

nice!

This permission was previously needed to support how we enforced timeouts, by listing all TaskRuns/PipelineRuns across all namespaces and determining whether they were past their timeout. Since tektoncd#3500 this check was changed to not require listing all namespaces, so I believe the permission is no longer necessary.

This permission was previously needed to support how we enforced timeouts, by listing all TaskRuns/PipelineRuns across all namespaces and determining whether they were past their timeout. Since #3500 this check was changed to not require listing all namespaces, so I believe the permission is no longer necessary.

tekton-robot requested review from bobcatfish and a user November 5, 2020 18:35

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 5, 2020

mattmoor commented Nov 5, 2020

View reviewed changes

mattmoor changed the title ~~[WIP] Refactor the way timeouts are handled in TaskRun~~ [WIP] Refactor the way timeouts are handled Nov 5, 2020

mattmoor commented Nov 5, 2020

View reviewed changes

pkg/reconciler/taskrun/taskrun.go Show resolved Hide resolved

tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 5, 2020

mattmoor changed the title ~~[WIP] Refactor the way timeouts are handled~~ Refactor the way timeouts are handled Nov 5, 2020

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2020

imjasonh reviewed Nov 5, 2020

View reviewed changes

pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved

pkg/reconciler/taskrun/taskrun.go Show resolved Hide resolved

pkg/reconciler/taskrun/taskrun_test.go Outdated Show resolved Hide resolved

pkg/reconciler/taskrun/taskrun_test.go Outdated Show resolved Hide resolved

mattmoor commented Nov 5, 2020

View reviewed changes

pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved

pkg/reconciler/taskrun/taskrun.go Show resolved Hide resolved

pkg/reconciler/taskrun/taskrun_test.go Outdated Show resolved Hide resolved

pkg/reconciler/taskrun/taskrun_test.go Outdated Show resolved Hide resolved

mattmoor force-pushed the refactor-timeout branch from e25a013 to b799942 Compare November 6, 2020 14:48

imjasonh reviewed Nov 6, 2020

View reviewed changes

Refactor the way timeouts are handled

319fea6

`{Task,Pipeline}Run` now handle timeouts via `EnqueueAfter` on the workqueue. `pkg/timeout` is now removed. We now have consistent `GetTimeout(ctx)` methods on types.

mattmoor force-pushed the refactor-timeout branch from 92e6f2f to 319fea6 Compare November 6, 2020 15:25

vdemeester approved these changes Nov 9, 2020

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2020

tekton-robot assigned dlorenc Nov 9, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 9, 2020

tekton-robot merged commit 8eaaeaa into tektoncd:master Nov 9, 2020

mattmoor deleted the refactor-timeout branch November 9, 2020 18:55

imjasonh mentioned this pull request Apr 14, 2021

Remove cluster-wide namespace list/watch permissions #3880

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the way timeouts are handled #3500

Refactor the way timeouts are handled #3500

mattmoor commented Nov 5, 2020 •

edited

Loading

mattmoor commented Nov 5, 2020

imjasonh commented Nov 5, 2020

mattmoor Nov 5, 2020

mattmoor commented Nov 5, 2020

mattmoor commented Nov 5, 2020

mattmoor commented Nov 5, 2020

mattmoor commented Nov 5, 2020

bobcatfish commented Nov 5, 2020

imjasonh left a comment

mattmoor left a comment

imjasonh Nov 6, 2020

mattmoor Nov 6, 2020

mattmoor Nov 6, 2020

tekton-robot commented Nov 6, 2020

tekton-robot commented Nov 6, 2020

vdemeester left a comment

tekton-robot commented Nov 9, 2020

tekton-robot commented Nov 9, 2020

dlorenc commented Nov 9, 2020

Refactor the way timeouts are handled #3500

Refactor the way timeouts are handled #3500

Conversation

mattmoor commented Nov 5, 2020 • edited Loading

Submitter Checklist

Reviewer Notes

Release Notes

mattmoor commented Nov 5, 2020

imjasonh commented Nov 5, 2020

mattmoor Nov 5, 2020

Choose a reason for hiding this comment

mattmoor commented Nov 5, 2020

mattmoor commented Nov 5, 2020

mattmoor commented Nov 5, 2020

mattmoor commented Nov 5, 2020

bobcatfish commented Nov 5, 2020

imjasonh left a comment

Choose a reason for hiding this comment

mattmoor left a comment

Choose a reason for hiding this comment

imjasonh Nov 6, 2020

Choose a reason for hiding this comment

mattmoor Nov 6, 2020

Choose a reason for hiding this comment

mattmoor Nov 6, 2020

Choose a reason for hiding this comment

tekton-robot commented Nov 6, 2020

tekton-robot commented Nov 6, 2020

vdemeester left a comment

Choose a reason for hiding this comment

tekton-robot commented Nov 9, 2020

tekton-robot commented Nov 9, 2020

dlorenc commented Nov 9, 2020

mattmoor commented Nov 5, 2020 •

edited

Loading