Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix workloadentry lost when fast reconnect #29685

Merged
merged 5 commits into from
Dec 28, 2020

Conversation

hzxuzhonghu
Copy link
Member

This can be reproduced using the pilot-load tools

@hzxuzhonghu hzxuzhonghu requested a review from a team as a code owner December 17, 2020 12:26
@google-cla google-cla bot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Dec 17, 2020
@istio-testing istio-testing added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 17, 2020
@hzxuzhonghu hzxuzhonghu added the release-notes-none Indicates a PR that does not require release notes. label Dec 17, 2020
@hzxuzhonghu
Copy link
Member Author

There is also another issue i observed:

when reconnect, normally the order is: disconnect->connect, but there is very little possibility the connect happens before the disconnect, in this scenario, if the reconnect to the same istiod, the workload entry will be lost.

I am thinking about checking if the related ads connection exist in shouldCleanupEntry

@hzxuzhonghu hzxuzhonghu requested a review from a team as a code owner December 17, 2020 13:05
@istio-testing istio-testing added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 17, 2020
@hzxuzhonghu
Copy link
Member Author

Based on one-night test result, it works for multi istiod instances as well

@istio-testing istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 18, 2020
@hzxuzhonghu
Copy link
Member Author

/test integ-pilot-multicluster-tests_istio

@hzxuzhonghu hzxuzhonghu changed the title Fix workloadentry lost when fast reconnect to the same istiod Fix workloadentry lost when fast reconnect Dec 22, 2020
// handle workload leak when both workload/pilot down at the same time before pilot has a chance to set disconnTime
connAt, err := time.Parse(timeFormat, connTime)
// if it has been 1.5*maxConnectionAge since workload connected, should delete it.
if err == nil && uint64(time.Since(connAt)) > uint64(c.maxConnectionAge)+uint64(c.maxConnectionAge/2) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used to handle wle leak in some case

// 1. disconnect: the workload entry has been updated
// 2. connect: but the patch is based on the old workloadentry because of the propagation latency.
// So in this case the `DisconnectedAtAnnotation` is still there and the cleanup procedure will go on.
connTime := wle.Annotations[ConnectedAtAnnotation]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach would probably work. I think we would also need the merge patch, or to add retry on errors.IsInvalid. If we delete the annotation on line 290 after we Get the entry for the patch, the patch will be confused whether to use op: add or op: replace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without merge-patch, the json-patch can return error, and then the stream will be recreated. We depend on the clientside retry.

If merge patch applied, it may mask potential races? I am not sure how much effort would be needed to handle that. This is the main concern i have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do retries, we probably want to leave as many annotations set as possible so that we can determine ordering. With the retry approach, we have a true "last-write-wins" on the entire object. Maybe this is easier.

Using merge means we avoid retries which seems easier to think about when we have this concurrent access. That approach is optimistic for connections, but discards disconnections more easily. There are a few more things to make it work included in my PR.

If we're going with the retry approach, I don't see a reason to use patch at all. If anything that would mask races, since it may or may not succeed even with a concurrent modification, even without merge patch. It's probably best to change this to Update for the purposes of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pr is not to care about whether patch, merge patch or update. The issue it tries to resolve can be not be done by purely change patch, merge patch or update.

We can discuss that in a new issue.

Copy link
Contributor

@stevenctl stevenctl Dec 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the rest approach relies on setting/unsetting annotations, you're likely to completely fail out on the Patch when not using merge patch. I'm fine with leaving the existing patch here, but adding retries on errors.IsInvalid.

If we're relying on retries anyway, it seems easier to just use Update + k8s.io/client-go/util/retry on errors.IsConflict. That way, we retry in all conflicts instead of some subset, making things easier to reason about and debug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, update sgtm.

@stevenctl
Copy link
Contributor

/retest

@istio-testing istio-testing merged commit 374fcdc into istio:master Dec 28, 2020
@hzxuzhonghu hzxuzhonghu deleted the fast-reconnect branch December 29, 2020 01:41
stevenctl pushed a commit to stevenctl/istio that referenced this pull request Jan 5, 2021
* fix workloadentry lost when fast reconnect happen

* Handle fast reconnect: disconnect event later than reconnect event

* Cleanup when both workload/pilot down at the same time before pilot has a chance to set disconnTime

* fix ut

* address comment
istio-testing pushed a commit that referenced this pull request Jan 6, 2021
* Consistently lock and copy ads clients (#28968)

* Consistently lock and copy ads clients

fixes #28958

* fix wrong reference

* optimize

* fix race

* Simplify internal generator and refactor a wle controller (#29554)

* Refactor internal generator and workloadentry controller

* Simplify internal gen

* fix ci

* fix lint

* fix comment

* Handle wle register/unregister race (#29604)

* handle register/unregister wle race

* update

* update tmpl

* Fix ut

* fix lint

* Fix workloadentry lost when fast reconnect (#29685)

* fix workloadentry lost when fast reconnect happen

* Handle fast reconnect: disconnect event later than reconnect event

* Cleanup when both workload/pilot down at the same time before pilot has a chance to set disconnTime

* fix ut

* address comment

* Fix leaks in workload entry controller (#29793)

* vm auto registration: discard stale disconnect events (#29691)

* implement merge patch in crd controller

* make wle auto registration robust to register unregister race

* release note

Co-authored-by: John Howard <[email protected]>
Co-authored-by: Zhonghu Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. release-notes-none Indicates a PR that does not require release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants