-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the way timeouts are handled #3500
Conversation
Staging this to let Prow test the first chunk as I look at doing the same elsewhere in the codebase. Will split each round into separate commits, if folks want to take a look and give feedback incrementally. |
cc @yaoxiaoqi |
@@ -1748,8 +1748,8 @@ func TestReconcileInvalidTaskRuns(t *testing.T) { | |||
|
|||
// Check actions and events | |||
actions := clients.Kube.Actions() | |||
if len(actions) != 3 || actions[0].Matches("namespaces", "list") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that all of these checks were previously incorrect because they were missing !
and had their arguments transposed. 🙃
Alright, I have two more commits staged that:
I kinda want to see the integration test results before pushing that, but heads up that it's coming (again separate commits). |
ok it passed, I'm going to push the other two commits now and remove the WIP. |
Apparently I didn't delete enough code 🤣 |
Alright, e2e tests have passed twice (once with pipelinerun changes). Hopefully this time everything is green, but I'd appreciate any feedback, so we can hopefully squash the timeout_test flake (or start chasing what's left). |
Thanks for working on this @mattmoor ! Looking forward to taking a look - tomorrow our team is having an "offsite" so they'll be a delay for me personally at least
Are you seeing these flakes locally, or in recent PRs, or somewhere else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this change, I think it's going to be a useful simplification, I just have some concerns about resource-exhaustion handling.
AIUI, for my own understanding, the old timeout handling code ended up calling Enqueue anyway, so we were always reliant on the workqueue to schedule the timeout check, and if the workqueue backed up we'd start to fall behind on timeout checks, so this shouldn't introduce any new behavior except that we rely on EnqueueAfter's internal delay mechanism instead of our own. Is that all correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI, for my own understanding, the old timeout handling code ended up calling Enqueue anyway,
Yes
so we were always reliant on the workqueue to schedule the timeout check, and if the workqueue backed up we'd start to fall behind on timeout checks,
The problem I saw was that we could actually end up calling the callback and processing the key before the resource was actually timed out[1], and if we miss that one shot the entire timeout handling is shot because we had very edge triggered logic to kick off the timeout handling logic.
[1] - I suspect this is due to jitter in the .status.StartTime
due to the stale informer issues we saw previously here: #3460 (with pipelinerun, where we clamped the StartTime to the child TaskRun time), so there is likely more we can do here (cc @pritidesai called this out), but this is certainly worth doing anyways as it's much more resilient than what's there now.
so this shouldn't introduce any new behavior except that we rely on EnqueueAfter's internal delay mechanism instead of our own. Is that all correct?
Essentially yes. In theory we could have used the old method here, but each invocation consumes a NEW go routine (not idempotent), which I suspect would tax the system under load. AFAIK the workqueue doesn't used go routines for EnqueueAfter
so I suspect this is more efficient even than what's there, and idempotent to boot so we can blindly EnqueueAfter
and let it deduplicate internally.
e25a013
to
b799942
Compare
@@ -797,12 +804,12 @@ func combineTaskRunAndTaskSpecAnnotations(pr *v1beta1.PipelineRun, pipelineTask | |||
return annotations | |||
} | |||
|
|||
func getPipelineRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun) metav1.Duration { | |||
func getPipelineRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun) time.Duration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need one of these for TaskRun timeouts too, the same config is used for both (which seems sort of odd to me 🤔 ), so maybe we can just write one GetTimeout(context.Context, *metav1.Duration) time.Duration
and share it in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed there is a GetTimeout()
above the lines in the taskrun reconciler, so I'm going so use that to handle defaulting, but it just uses the static default, not the configurable default. This should probably be fixed as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright fixed in a separate commit.
The following is the coverage report on the affected files.
|
`{Task,Pipeline}Run` now handle timeouts via `EnqueueAfter` on the workqueue. `pkg/timeout` is now removed. We now have consistent `GetTimeout(ctx)` methods on types.
92e6f2f
to
319fea6
Compare
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/meow
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm nice! |
This permission was previously needed to support how we enforced timeouts, by listing all TaskRuns/PipelineRuns across all namespaces and determining whether they were past their timeout. Since tektoncd#3500 this check was changed to not require listing all namespaces, so I believe the permission is no longer necessary.
This permission was previously needed to support how we enforced timeouts, by listing all TaskRuns/PipelineRuns across all namespaces and determining whether they were past their timeout. Since #3500 this check was changed to not require listing all namespaces, so I believe the permission is no longer necessary.
/kind cleanup
Fixes: #2905
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.
Release Notes