Fix workloadentry lost when fast reconnect #29685

hzxuzhonghu · 2020-12-17T12:26:14Z

This can be reproduced using the pilot-load tools

hzxuzhonghu · 2020-12-17T12:32:21Z

There is also another issue i observed:

when reconnect, normally the order is: disconnect->connect, but there is very little possibility the connect happens before the disconnect, in this scenario, if the reconnect to the same istiod, the workload entry will be lost.

I am thinking about checking if the related ads connection exist in shouldCleanupEntry

hzxuzhonghu · 2020-12-18T02:01:15Z

Based on one-night test result, it works for multi istiod instances as well

…as a chance to set disconnTime

hzxuzhonghu · 2020-12-21T06:43:55Z

/test integ-pilot-multicluster-tests_istio

hzxuzhonghu · 2020-12-22T11:54:06Z

pilot/pkg/controller/workloadentry/workloadentry_controller.go

+		// handle workload leak when both workload/pilot down at the same time before pilot has a chance to set disconnTime
+		connAt, err := time.Parse(timeFormat, connTime)
+		// if it has been 1.5*maxConnectionAge since workload connected, should delete it.
+		if err == nil && uint64(time.Since(connAt)) > uint64(c.maxConnectionAge)+uint64(c.maxConnectionAge/2) {


this is used to handle wle leak in some case

pilot/pkg/controller/workloadentry/workloadentry_controller.go

stevenctl · 2020-12-22T16:37:30Z

pilot/pkg/controller/workloadentry/workloadentry_controller.go

+	// 1. disconnect: the workload entry has been updated
+	// 2. connect: but the patch is based on the old workloadentry because of the propagation latency.
+	// So in this case the `DisconnectedAtAnnotation` is still there and the cleanup procedure will go on.
+	connTime := wle.Annotations[ConnectedAtAnnotation]


This approach would probably work. I think we would also need the merge patch, or to add retry on errors.IsInvalid. If we delete the annotation on line 290 after we Get the entry for the patch, the patch will be confused whether to use op: add or op: replace.

Without merge-patch, the json-patch can return error, and then the stream will be recreated. We depend on the clientside retry.

If merge patch applied, it may mask potential races？ I am not sure how much effort would be needed to handle that. This is the main concern i have.

If we do retries, we probably want to leave as many annotations set as possible so that we can determine ordering. With the retry approach, we have a true "last-write-wins" on the entire object. Maybe this is easier.

Using merge means we avoid retries which seems easier to think about when we have this concurrent access. That approach is optimistic for connections, but discards disconnections more easily. There are a few more things to make it work included in my PR.

If we're going with the retry approach, I don't see a reason to use patch at all. If anything that would mask races, since it may or may not succeed even with a concurrent modification, even without merge patch. It's probably best to change this to Update for the purposes of this PR.

This pr is not to care about whether patch, merge patch or update. The issue it tries to resolve can be not be done by purely change patch, merge patch or update.

We can discuss that in a new issue.

If the rest approach relies on setting/unsetting annotations, you're likely to completely fail out on the Patch when not using merge patch. I'm fine with leaving the existing patch here, but adding retries on errors.IsInvalid.

If we're relying on retries anyway, it seems easier to just use Update + k8s.io/client-go/util/retry on errors.IsConflict. That way, we retry in all conflicts instead of some subset, making things easier to reason about and debug.

yeah, update sgtm.

stevenctl · 2020-12-28T17:15:34Z

/retest

* fix workloadentry lost when fast reconnect happen * Handle fast reconnect： disconnect event later than reconnect event * Cleanup when both workload/pilot down at the same time before pilot has a chance to set disconnTime * fix ut * address comment

* Consistently lock and copy ads clients (#28968) * Consistently lock and copy ads clients fixes #28958 * fix wrong reference * optimize * fix race * Simplify internal generator and refactor a wle controller (#29554) * Refactor internal generator and workloadentry controller * Simplify internal gen * fix ci * fix lint * fix comment * Handle wle register/unregister race (#29604) * handle register/unregister wle race * update * update tmpl * Fix ut * fix lint * Fix workloadentry lost when fast reconnect (#29685) * fix workloadentry lost when fast reconnect happen * Handle fast reconnect： disconnect event later than reconnect event * Cleanup when both workload/pilot down at the same time before pilot has a chance to set disconnTime * fix ut * address comment * Fix leaks in workload entry controller (#29793) * vm auto registration: discard stale disconnect events (#29691) * implement merge patch in crd controller * make wle auto registration robust to register unregister race * release note Co-authored-by: John Howard <[email protected]> Co-authored-by: Zhonghu Xu <[email protected]>

fix workloadentry lost when fast reconnect happen

211efa5

hzxuzhonghu requested a review from stevenctl December 17, 2020 12:26

hzxuzhonghu requested a review from a team as a code owner December 17, 2020 12:26

google-cla bot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Dec 17, 2020

istio-testing added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 17, 2020

hzxuzhonghu added the release-notes-none Indicates a PR that does not require release notes. label Dec 17, 2020

Handle fast reconnect： disconnect event later than reconnect event

2e3f69b

hzxuzhonghu requested a review from a team as a code owner December 17, 2020 13:05

istio-testing added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 17, 2020

stevenctl mentioned this pull request Dec 17, 2020

vm auto registration: discard stale disconnect events #29691

Merged

Cleanup when both workload/pilot down at the same time before pilot h…

6e3c6fb

…as a chance to set disconnTime

hzxuzhonghu force-pushed the fast-reconnect branch from bfad27c to 6e3c6fb Compare December 18, 2020 08:32

istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 18, 2020

fix ut

e965f0f

hzxuzhonghu changed the title ~~Fix workloadentry lost when fast reconnect to the same istiod~~ Fix workloadentry lost when fast reconnect Dec 22, 2020

hzxuzhonghu commented Dec 22, 2020

View reviewed changes

howardjohn reviewed Dec 22, 2020

View reviewed changes

pilot/pkg/controller/workloadentry/workloadentry_controller.go Show resolved Hide resolved

pilot/pkg/controller/workloadentry/workloadentry_controller.go Outdated Show resolved Hide resolved

stevenctl reviewed Dec 22, 2020

View reviewed changes

address comment

c5c6944

stevenctl approved these changes Dec 28, 2020

View reviewed changes

istio-testing merged commit 374fcdc into istio:master Dec 28, 2020

hzxuzhonghu deleted the fast-reconnect branch December 29, 2020 01:41

stevenctl mentioned this pull request Jan 5, 2021

backport various WorkloadEntry auto-registration fixes #29876

Merged

6 tasks

incfly mentioned this pull request Oct 13, 2021

Bump up istio/istio to f527312d926e2a36b94fd1bb4bf72ef27ffda781. tetratelabs/istio#506

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix workloadentry lost when fast reconnect #29685

Fix workloadentry lost when fast reconnect #29685

hzxuzhonghu commented Dec 17, 2020

hzxuzhonghu commented Dec 17, 2020

hzxuzhonghu commented Dec 18, 2020

hzxuzhonghu commented Dec 21, 2020

hzxuzhonghu Dec 22, 2020

stevenctl Dec 22, 2020

hzxuzhonghu Dec 23, 2020

stevenctl Dec 23, 2020

hzxuzhonghu Dec 24, 2020

stevenctl Dec 28, 2020 •

edited

Loading

hzxuzhonghu Dec 29, 2020

stevenctl commented Dec 28, 2020

Fix workloadentry lost when fast reconnect #29685

Fix workloadentry lost when fast reconnect #29685

Conversation

hzxuzhonghu commented Dec 17, 2020

hzxuzhonghu commented Dec 17, 2020

hzxuzhonghu commented Dec 18, 2020

hzxuzhonghu commented Dec 21, 2020

hzxuzhonghu Dec 22, 2020

Choose a reason for hiding this comment

stevenctl Dec 22, 2020

Choose a reason for hiding this comment

hzxuzhonghu Dec 23, 2020

Choose a reason for hiding this comment

stevenctl Dec 23, 2020

Choose a reason for hiding this comment

hzxuzhonghu Dec 24, 2020

Choose a reason for hiding this comment

stevenctl Dec 28, 2020 • edited Loading

Choose a reason for hiding this comment

hzxuzhonghu Dec 29, 2020

Choose a reason for hiding this comment

stevenctl commented Dec 28, 2020

stevenctl Dec 28, 2020 •

edited

Loading