Move backoff logic into own struct and increase test coverage of timeout logic #3031

bobcatfish · 2020-07-29T22:04:11Z

Changes

Separate backoff logic from timeout logic ⏰

The timeout handler was trying to do two different jobs:

Track timeouts for executing TaskRuns and PipelineRuns; when they
timeout, they should be re-reconciled, at which point they would
fail b/c they took to long
Track backoffs when TaskRun pod creation fails due to resource quota
issues. In this case we want to retry pod creation (by re-reconciling)
after a backoff in case the resource issue has been resolved.

The logic for these 2 jobs was combined because they both share at their
heart very similar logic: the logic for creating go routines that wait
until a period of time has passed.

Now these two jobs are done by 2 different structs, which both use the
new Timer struct and logic, so they can share this logic, without having
to mix the code on top.

Future things we can do to make this even better:

Add more test coverage for the new structs
The "Handler" struct (which may have too generic a name at this point)
probably doesn't need to have separate logic for pipelineruns +
taskruns since each of these controllers will instanitate it
separately
As discussed in The timeoutHandler is only instructed to wait when it creates pods #2905, knative/pkg's controller logic will attempt
to re-reconcile all the Runs on start up, so we probably don't need
to be explicitly doing this as well

Add more timeout test coverage and fix timeout "bug" 🐞

Add more tests to increase coverage of timeout package. This means
updating some code to be more unit testable, specifically injecting the
function used to determine the current time into Backoff.

This revealed a "bug" where the backoff logic wasn't using a default
value when the TaskRun timeout isn't set - however the defaulting logic
ensures that it is always set, so if anything, maybe we should actually
panic in this case.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
[n/a] Includes docs (if user facing)
Commit messages follow commit message best practices
Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

If you are adding a new binary/image to the cmd dir, please update
the release Task to build and release this image.

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

It almost seems like this is fixing a bug, but because of how the defaulting logic works, I don't think it really is.

NONE

tekton-robot · 2020-07-29T22:04:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign bobcatfish
You can assign the PR to them by writing /assign @bobcatfish in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot · 2020-07-29T22:07:48Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/taskrun/controller.go	96.6%	96.7%	0.1
pkg/reconciler/taskrun/taskrun.go	78.3%	78.4%	0.1
pkg/timeout/backoff.go	Do not exist	93.9%
pkg/timeout/handler.go	77.4%	84.0%	6.6
pkg/timeout/timer.go	Do not exist	100.0%

bobcatfish · 2020-07-30T19:27:23Z

oooo a data race!

==================
WARNING: DATA RACE
Read at 0x00c000c3c198 by goroutine 103:
  reflect.typedmemmove()
      /usr/local/go/src/runtime/mbarrier.go:177 +0x0
  reflect.packEface()
      /usr/local/go/src/reflect/value.go:119 +0x126
  reflect.valueInterface()
      /usr/local/go/src/reflect/value.go:1023 +0x1b9
  reflect.Value.Interface()
      /usr/local/go/src/reflect/value.go:993 +0x3c27
  fmt.(*pp).printValue()
      /usr/local/go/src/fmt/print.go:726 +0x3c28
  fmt.(*pp).printValue()
      /usr/local/go/src/fmt/print.go:880 +0x2709
  fmt.(*pp).printArg()
      /usr/local/go/src/fmt/print.go:716 +0x25a
  fmt.(*pp).doPrintf()
      /usr/local/go/src/fmt/print.go:1030 +0x311
  fmt.Sprintf()
      /usr/local/go/src/fmt/print.go:219 +0x73
  go.uber.org/zap.(*SugaredLogger).log()
      /home/prow/go/src/github.com/tektoncd/pipeline/vendor/go.uber.org/zap/sugar.go:230 +0x175
  go.uber.org/zap.(*SugaredLogger).Errorf()
      /home/prow/go/src/github.com/tektoncd/pipeline/vendor/go.uber.org/zap/sugar.go:148 +0x331
  github.com/tektoncd/pipeline/pkg/timeout.(*Backoff).SetTimer()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/timeout/backoff.go:142 +0x29a

Previous write at 0x00c000c3c198 by goroutine 79:
  knative.dev/pkg/apis/duck/v1beta1.(*Status).SetConditions()
      /home/prow/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/apis/duck/v1beta1/status_types.go:84 +0x43
  github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1.(*TaskRunStatus).SetConditions()
      <autogenerated>:1 +0x61
  knative.dev/pkg/apis.conditionsImpl.SetCondition()
      /home/prow/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/apis/condition_set.go:208 +0x3ee
  knative.dev/pkg/apis.(*conditionsImpl).SetCondition()
      <autogenerated>:1 +0x149
  github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1.(*TaskRunStatus).SetCondition()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1/taskrun_types.go:253 +0x43c
  github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).handlePodCreationError()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:470 +0x144
  github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestHandlePodCreationError.func1()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:2100 +0xd5
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:992 +0x1eb

Goroutine 103 (running) created at:
  github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).handlePodCreationError()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:467 +0x52a
  github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestHandlePodCreationError.func1()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:2100 +0xd5
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:992 +0x1eb

Goroutine 79 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1043 +0x660
  github.com/tektoncd/pipeline/pkg/reconciler/taskrun.TestHandlePodCreationError()
      /home/prow/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun_test.go:2099 +0x12bd
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:992 +0x1eb

bobcatfish · 2020-07-30T19:35:27Z

I think this race condition might be another instance of #1307 🤔

The timeout handler was trying to do two different jobs: - Track timeouts for executing TaskRuns and PipelineRuns; when they timeout, they should be re-reconciled, at which point they would fail b/c they took to long - Track backoffs when TaskRun pod creation fails due to resource quota issues. In this case we want to retry pod creation (by re-reconciling) after a backoff in case the resource issue has been resolved. The logic for these 2 jobs was combined because they both share at their heart very similar logic: the logic for creating go routines that wait until a period of time has passed. Now these two jobs are done by 2 different structs, which both use the new Timer struct and logic, so they can share this logic, without having to mix the code on top. Future things we can do to make this even better: - Add more test coverage for the new structs - The "Handler" struct (which may have too generic a name at this point) probably doesn't need to have separate logic for pipelineruns + taskruns since each of these controllers will instanitate it separately - As discussed in tektoncd#2905, knative/pkg's controller logic will attempt to re-reconcile all the Runs on start up, so we probably don't need to be explicitly doing this as well

Add more tests to increase coverage of timeout package. This means updating some code to be more unit testable, specifically injecting the function used to determine the current time into Backoff. This revealed a "bug" where the backoff logic wasn't using a default value when the TaskRun timeout isn't set - however the defaulting logic ensures that it is always set, so if anything, maybe we should actually panic in this case.

tekton-robot · 2020-07-30T19:40:03Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/taskrun/controller.go	96.6%	96.7%	0.1
pkg/reconciler/taskrun/taskrun.go	78.3%	78.4%	0.1
pkg/timeout/backoff.go	Do not exist	93.9%
pkg/timeout/handler.go	77.4%	84.0%	6.6
pkg/timeout/timer.go	Do not exist	100.0%

tekton-robot · 2020-07-30T19:47:10Z

@bobcatfish: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-tekton-pipeline-unit-tests	`7da38ae`	link	`/test tekton-pipeline-unit-tests`
pull-tekton-pipeline-build-tests	`7da38ae`	link	`/test pull-tekton-pipeline-build-tests`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

pritidesai · 2020-07-31T02:00:28Z

pkg/timeout/backoff.go

+// 1. Stop signal, 2. completion or 3. a given Duration to elapse.
+func (b *Backoff) SetTimer(runObj StatusKey, d time.Duration) {
+	if b.timeoutCallback == nil {
+		b.logger.Errorf("attempted to set a timer for %q but no callback has been assigned", runObj)


Sounds very odd (at least to me 😉), commenting out this logger from here fixes race condition. This function invocation within handlePodCreationError and the following SetCondition are somehow causing this.

WARNING: DATA RACE Read at 0x00c000c3c198 by goroutine 103: ... go.uber.org/zap.(*SugaredLogger).log() ... Previous write at 0x00c000c3c198 by goroutine 79: knative.dev/pkg/apis/duck/v1beta1.(*Status).SetConditions() ...

Ah, well spotted! It looks like swapping out the %q for a %p in this log statement also removes the error. Using %q results in a string-formatted struct being printed, which probably walks the entire object and accesses all the fields. My hunch is that the %q formatter is not threadsafe in that case? In hindsight this may be why we have the StatusKey interface - the GetRunKey() func of TaskRun returns a thread-safe descriptor.

nice catch, yup it does walk the entire object. The next log statement (using GetRunKey() on runObj) looks safe and just prints the address of taskRun object instead.

"TaskRun/0xc00015af00"

bobcatfish · 2020-07-31T19:19:05Z

While trying to figure this out I think I found a bigger problem, I'm going to close this PR for now b/c I'll need to rebase it anyway.

My hunch is that the %q formatter is not threadsafe in that case? In hindsight this may be why we have the StatusKey interface - the GetRunKey() func of TaskRun returns a thread-safe descriptor.

In fact I think this is a bug that I introduced - the address of the Run might be thread safe, but it's not going to be the same on every reconcile! So I think that actually every single Run is timing out because the done channel can never be signaled 😅

tekton-robot added the release-note-none Denotes a PR that doesnt merit a release note. label Jul 29, 2020

tekton-robot requested review from dlorenc and pritidesai July 29, 2020 22:04

tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 29, 2020

bobcatfish added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Jul 29, 2020

bobcatfish added 2 commits July 30, 2020 15:37

bobcatfish force-pushed the moar_timeout branch from 5289060 to 7da38ae Compare July 30, 2020 19:37

pritidesai reviewed Jul 31, 2020

View reviewed changes

bobcatfish closed this Jul 31, 2020

bobcatfish mentioned this pull request Aug 7, 2020

Actually stop trying to time out finished Runs ⏰ #3078

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move backoff logic into own struct and increase test coverage of timeout logic #3031

Move backoff logic into own struct and increase test coverage of timeout logic #3031

bobcatfish commented Jul 29, 2020

tekton-robot commented Jul 29, 2020

tekton-robot commented Jul 29, 2020

bobcatfish commented Jul 30, 2020

bobcatfish commented Jul 30, 2020

tekton-robot commented Jul 30, 2020

tekton-robot commented Jul 30, 2020 •

edited

Loading

pritidesai Jul 31, 2020

pritidesai Jul 31, 2020

ghost Jul 31, 2020 •

edited by ghost

Loading

pritidesai Jul 31, 2020 •

edited

Loading

bobcatfish commented Jul 31, 2020

Move backoff logic into own struct and increase test coverage of timeout logic #3031

Move backoff logic into own struct and increase test coverage of timeout logic #3031

Conversation

bobcatfish commented Jul 29, 2020

Changes

Separate backoff logic from timeout logic ⏰

Add more timeout test coverage and fix timeout "bug" 🐞

Submitter Checklist

Reviewer Notes

Release Notes

tekton-robot commented Jul 29, 2020

tekton-robot commented Jul 29, 2020

bobcatfish commented Jul 30, 2020

bobcatfish commented Jul 30, 2020

tekton-robot commented Jul 30, 2020

tekton-robot commented Jul 30, 2020 • edited Loading

pritidesai Jul 31, 2020

Choose a reason for hiding this comment

pritidesai Jul 31, 2020

Choose a reason for hiding this comment

ghost Jul 31, 2020 • edited by ghost Loading

Choose a reason for hiding this comment

pritidesai Jul 31, 2020 • edited Loading

Choose a reason for hiding this comment

bobcatfish commented Jul 31, 2020

tekton-robot commented Jul 30, 2020 •

edited

Loading

ghost Jul 31, 2020 •

edited by ghost

Loading

pritidesai Jul 31, 2020 •

edited

Loading