-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestTaskRunTimeout is flakey : fix timeouts #731
Comments
Keeping track on progress fixing it in tektoncd#731 Signed-off-by: Vincent Demeester <[email protected]>
Keeping track on progress fixing it in #731 Signed-off-by: Vincent Demeester <[email protected]>
@vdemeester just let me know if you want a break from the investigation for a while, I'd be happy to take it on for a while 💃 |
/assign @bobcatfish The more the merrier (especially as I'm running out of time this week) |
tl;dr: The timeout handle and the reconciler disagree about when a Run has timed out. Okay I have a theory about what's going on. In #771 I ran the timeout test 10 times (and p.s. this seems to reproduce the issue locally as well), and in https://tekton-releases.appspot.com/build/tekton-prow/pr-logs/pull/tektoncd_pipeline/771/pull-tekton-pipeline-integration-tests/1118981945099816962/ we have 2/10 of those timing out. I also added a lot of (terrible) logs so I could see what was going on and I noticed some interesting things:
For example for
This would then trigger the reconcile loop, and the logs from that look like:
As far as the reconcile loop is concerned, the timeout is And the same thing for
So ∴ the theory is that the timeout handler and the reconcile loop are slightly out of sync in their logic to determine when a timeout has occurred, i.e. Anyway I don't know what the fix is haha - at a glance they SEEM to be doing the same thing. But the key imo is in making sure that exactly the same logic is being applied in both places OR in moving the logic to handle the timeout case into the timeout handler itself (instead of triggering an entire reconcile). Whew! (maybe 🤞🤞 🤞 ) We need to:
|
Can see similar stuff happening in https://storage.googleapis.com/tekton-prow/pr-logs/pull/tektoncd_pipeline/771/pull-tekton-pipeline-integration-tests/1118995157169999873/build-log.txt as well (with "improved" logs):
|
I foolishly tried to use the example at the top of the test README when I was trying to run the timeout tests (for the tektoncd#731 investigation) and I was sad when it failed b/c the default timeout was too short. This is the timeout we are using when running CI anyway!
I foolishly tried to use the example at the top of the test README when I was trying to run the timeout tests (for the #731 investigation) and I was sad when it failed b/c the default timeout was too short. This is the timeout we are using when running CI anyway!
I originally tried to completely remove the PipelineRun timeout handler - but it turns out that actually both controllers _do_ check for timeouts, they just do it differently. I hope in a follow up PR to make them more similar, but in the meantime, I refactored the timeout handler logic such that the handler itself is the same for both kinds of runs, and added the same `HasStarted` function for TaskRuns that PipelineRuns have. Also: - Added more logs (which was how I figured out the problem with tektoncd#731 in the first place!) - Followed some linting guidelines, fixing typos, adding keyed fields
Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <[email protected]>
Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <[email protected]>
I originally tried to completely remove the PipelineRun timeout handler - but it turns out that actually both controllers _do_ check for timeouts, they just do it differently. I hope in a follow up PR to make them more similar, but in the meantime, I refactored the timeout handler logic such that the handler itself is the same for both kinds of runs, and added the same `HasStarted` function for TaskRuns that PipelineRuns have. Also: - Added more logs (which was how I figured out the problem with #731 in the first place!) - Followed some linting guidelines, fixing typos, adding keyed fields
I foolishly tried to use the example at the top of the test README when I was trying to run the timeout tests (for the tektoncd#731 investigation) and I was sad when it failed b/c the default timeout was too short. This is the timeout we are using when running CI anyway!
I originally tried to completely remove the PipelineRun timeout handler - but it turns out that actually both controllers _do_ check for timeouts, they just do it differently. I hope in a follow up PR to make them more similar, but in the meantime, I refactored the timeout handler logic such that the handler itself is the same for both kinds of runs, and added the same `HasStarted` function for TaskRuns that PipelineRuns have. Also: - Added more logs (which was how I figured out the problem with tektoncd#731 in the first place!) - Followed some linting guidelines, fixing typos, adding keyed fields
Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <[email protected]>
I foolishly tried to use the example at the top of the test README when I was trying to run the timeout tests (for the tektoncd#731 investigation) and I was sad when it failed b/c the default timeout was too short. This is the timeout we are using when running CI anyway!
I originally tried to completely remove the PipelineRun timeout handler - but it turns out that actually both controllers _do_ check for timeouts, they just do it differently. I hope in a follow up PR to make them more similar, but in the meantime, I refactored the timeout handler logic such that the handler itself is the same for both kinds of runs, and added the same `HasStarted` function for TaskRuns that PipelineRuns have. Also: - Added more logs (which was how I figured out the problem with tektoncd#731 in the first place!) - Followed some linting guidelines, fixing typos, adding keyed fields
Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <[email protected]>
Logging in the timeout handler was added as part of tektoncd#731 cuz it helped us debug when the timeout handler didn't work as expected. Unfortunately it looks like the logger we're using can't be used in multiple go routines (uber-go/zap#99 may be related). Removing this logging to fix tektoncd#1124, hopefully can find a safe way to add logging back in tektoncd#1307.
Logging in the timeout handler was added as part of #731 cuz it helped us debug when the timeout handler didn't work as expected. Unfortunately it looks like the logger we're using can't be used in multiple go routines (uber-go/zap#99 may be related). Removing this logging to fix #1124, hopefully can find a safe way to add logging back in #1307.
Minor fix: escape 'envsubst' using 'DOLLAR'
Expected Behavior
TestTaskRunTimeout
should pass all the time 💯Actual Behavior
TestTaskRunTimeout
is failing a lot. I'll skip it so thatHere are some examples of failures:
Steps to Reproduce the Problem
/retest
and 🙏I'm working on it in #691 but it's at a point where it makes a lot of build fail, and I'm really thinking of marking the test as skip and track the re-enablement of it here
cc @bobcatfish
/kind bug
/assign
The text was updated successfully, but these errors were encountered: