Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid modifications to the informer's copy of resources. #2736

Merged
merged 2 commits into from
Jun 3, 2020

Conversation

mattmoor
Copy link
Member

@mattmoor mattmoor commented Jun 3, 2020

In each of the {Pipeline,Task}Run reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method.

In pseudo-code:

func update(fr *FooRun) {
  newFr := lister.Get(fr.Name)

  if reflect.DeepEqual(newFr.Field, fr.Field) {
    newFr.Field = fr.Field   // This modified the informer's copy!
    return client.Update(newFr)
  }
}

Changes

I have worked around this in two different ways:

  1. For the status updates I added a line like newFr = newFr.DeepCopy() immediately above the mutation to avoid writing to the informer's copy.

  2. For the label/annotation updates, I changed the Update call to a Patch that bypasses optimistic concurrency checks. This last bit is important because otherwise the update above will lead to the first reconciliation always failing due to resourceVersion skew caused by the status update. This also works around some fun interactions with the test code (see fixed issue).

There are two other notable aspects to this change:

  1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered. I changed most of these to refetch through the client.
  2. D-Fence! I added some logic to some of the common test setup code to DeepCopy() resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in.

It is also worth calling out that this change will very likely destabilize the metric that I identified here as racy, which is likely masked by the mutation of the informer copies.

Fixes: #2734

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

N/A

/hold

I'm throwing a hold on this due to the lingering t.Skip / DO NOT SUBMIT, which I am asking about in slack, but I wanted to fling this at CI to see what falls out.

In each of the `{Pipeline,Task}Run` reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method.

In pseudo-code:

```go
func update(fr *FooRun) {
  newFr := lister.Get(fr.Name)

  if reflect.DeepEqual(newFr.Field, fr.Field) {
    newFr.Field = fr.Field   // This modified the informer's copy!
    return client.Update(newFr)
  }
}
```

I have worked around this in two different ways:

1. For the status updates I added a line like `newFr = newFr.DeepCopy()` immediately above the mutation to avoid writing to the informer's copy.

2. For the label/annotation updates, I changed the `Update` call to a `Patch` that bypasses optimistic concurrency checks.  This last bit is important because otherwise the update above will lead to the first reconciliation *always* failing due to `resourceVersion` skew caused by the status update.  This also works around some fun interactions with the test code (see fixed issue).

There are two other notable aspects to this change:

1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered.  I changed most of these to refetch through the client.
2. D-Fence! I added some logic to some of the common test setup code to `DeepCopy()` resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in.

It is also worth calling out that this change will very likely destabilize the metric that I identified [here](tektoncd#2729) as racy, which is likely masked by the mutation of the informer copies.

Fixes: tektoncd#2734
@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2020
@tekton-robot tekton-robot requested review from imjasonh and a user June 3, 2020 05:08
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 3, 2020
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

2 similar comments
@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@tekton-robot
Copy link
Collaborator

This PR cannot be merged: expecting exactly one kind/ label

Available kind/ labels are:

kind/bug: Categorizes issue or PR as related to a bug.
kind/flake: Categorizes issue or PR as related to a flakey test
kind/cleanup: Categorizes issue or PR as related to cleaning up code, process, or technical debt.
kind/design: Categorizes issue or PR as related to design.
kind/documentation: Categorizes issue or PR as related to documentation.
kind/feature: Categorizes issue or PR as related to a new feature.
kind/misc: Categorizes issue or PR as a miscellaneuous one.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 75.9% 75.9% -0.1
pkg/reconciler/taskrun/taskrun.go 76.1% 76.0% -0.1

@mattmoor
Copy link
Member Author

mattmoor commented Jun 3, 2020

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 3, 2020
Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks so much for discovering and reporting and fixing this! Is there anything we can do to avoid misusing informer-provided values in this way in the future? It will be hard to ensure all future changes use a defensive copy.

Either way, lgtm, and thanks again!

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2020
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ImJasonH

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 3, 2020
@afrittoli
Copy link
Member

This is great, thank you!
I started looking into it, it's not clear to me yet why the TestReconcileWithPipelineResults - whether it's a problem of the test or if it's a problem of the reconciler code.
If the former I think it might be ok to merge this, and go fix the test afterwards.

@mattmoor
Copy link
Member Author

mattmoor commented Jun 3, 2020

/retest

1 similar comment
@mattmoor
Copy link
Member Author

mattmoor commented Jun 3, 2020

/retest

@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 75.9% 80.5% 4.6
pkg/reconciler/taskrun/taskrun.go 76.1% 76.0% -0.1

@mattmoor
Copy link
Member Author

mattmoor commented Jun 3, 2020

Is there anything we can do to avoid misusing informer-provided values in this way in the future?

@imjasonh basically what we do is run the bulk of our test coverage through our own Reconciler "TableTest", and one of the things this does is sort of what I changed the test code to do, but better: It copies the input universe before running the test and then checks that the informer copies weren't changed (showing the diff). Using this for the bulk of our coverage, even when we needed to write more focused testing we usually don't see things creep in.

The problem generally isn't even the primary resource (because the boilerplate setup generally always copies things), but lister fetches for child resources that might need updating. So even with // +genreconciler taking over the boilerplate part, the child resource reconciliations can still get bitten by this.

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for fixing the test as well!
/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2020
@afrittoli
Copy link
Member

Once we start fetching specs from OCI containers, we'll need an equivalent of the informers cache for that too!

@afrittoli
Copy link
Member

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 3, 2020
@mattmoor
Copy link
Member Author

mattmoor commented Jun 3, 2020

/test pull-tekton-pipeline-build-tests

@tekton-robot tekton-robot merged commit 49828a1 into tektoncd:master Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tekton modifies objects in the informer cache
4 participants