-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid modifications to the informer's copy of resources. #2736
Conversation
In each of the `{Pipeline,Task}Run` reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method. In pseudo-code: ```go func update(fr *FooRun) { newFr := lister.Get(fr.Name) if reflect.DeepEqual(newFr.Field, fr.Field) { newFr.Field = fr.Field // This modified the informer's copy! return client.Update(newFr) } } ``` I have worked around this in two different ways: 1. For the status updates I added a line like `newFr = newFr.DeepCopy()` immediately above the mutation to avoid writing to the informer's copy. 2. For the label/annotation updates, I changed the `Update` call to a `Patch` that bypasses optimistic concurrency checks. This last bit is important because otherwise the update above will lead to the first reconciliation *always* failing due to `resourceVersion` skew caused by the status update. This also works around some fun interactions with the test code (see fixed issue). There are two other notable aspects to this change: 1. Test bugs! There were a good number of places that were assuming that the object stored in the informer was altered. I changed most of these to refetch through the client. 2. D-Fence! I added some logic to some of the common test setup code to `DeepCopy()` resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in. It is also worth calling out that this change will very likely destabilize the metric that I identified [here](tektoncd#2729) as racy, which is likely masked by the mutation of the informer copies. Fixes: tektoncd#2734
This PR cannot be merged: expecting exactly one kind/ label Available
|
2 similar comments
This PR cannot be merged: expecting exactly one kind/ label Available
|
This PR cannot be merged: expecting exactly one kind/ label Available
|
The following is the coverage report on the affected files.
|
/kind bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Thanks so much for discovering and reporting and fixing this! Is there anything we can do to avoid misusing informer-provided values in this way in the future? It will be hard to ensure all future changes use a defensive copy.
Either way, lgtm, and thanks again!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ImJasonH The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is great, thank you! |
/retest |
1 similar comment
/retest |
The following is the coverage report on the affected files.
|
@imjasonh basically what we do is run the bulk of our test coverage through our own Reconciler "TableTest", and one of the things this does is sort of what I changed the test code to do, but better: It copies the input universe before running the test and then checks that the informer copies weren't changed (showing the diff). Using this for the bulk of our coverage, even when we needed to write more focused testing we usually don't see things creep in. The problem generally isn't even the primary resource (because the boilerplate setup generally always copies things), but lister fetches for child resources that might need updating. So even with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for fixing the test as well!
/lgtm
Once we start fetching specs from OCI containers, we'll need an equivalent of the informers cache for that too! |
/hold cancel |
/test pull-tekton-pipeline-build-tests |
In each of the
{Pipeline,Task}Run
reconcilers the functions to update status and labels/annotations refetch the resource from the informer cache, check the field they want to update, and if an update is needed they set the field on the informer's copy and call the appropriate update method.In pseudo-code:
Changes
I have worked around this in two different ways:
For the status updates I added a line like
newFr = newFr.DeepCopy()
immediately above the mutation to avoid writing to the informer's copy.For the label/annotation updates, I changed the
Update
call to aPatch
that bypasses optimistic concurrency checks. This last bit is important because otherwise the update above will lead to the first reconciliation always failing due toresourceVersion
skew caused by the status update. This also works around some fun interactions with the test code (see fixed issue).There are two other notable aspects to this change:
DeepCopy()
resources before feeding them to the fake clients to try and avoid assumptions about "same object" creeping back in.It is also worth calling out that this change will very likely destabilize the metric that I identified here as racy, which is likely masked by the mutation of the informer copies.
Fixes: #2734
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.
Release Notes
N/A
/hold
I'm throwing a hold on this due to the lingering
t.Skip
/DO NOT SUBMIT
, which I am asking about in slack, but I wanted to fling this at CI to see what falls out.