-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix workloadentry lost when fast reconnect #29685
Conversation
There is also another issue i observed: when reconnect, normally the order is: disconnect->connect, but there is very little possibility the connect happens before the disconnect, in this scenario, if the reconnect to the same istiod, the workload entry will be lost. I am thinking about checking if the related ads connection exist in |
Based on one-night test result, it works for multi istiod instances as well |
…as a chance to set disconnTime
bfad27c
to
6e3c6fb
Compare
/test integ-pilot-multicluster-tests_istio |
// handle workload leak when both workload/pilot down at the same time before pilot has a chance to set disconnTime | ||
connAt, err := time.Parse(timeFormat, connTime) | ||
// if it has been 1.5*maxConnectionAge since workload connected, should delete it. | ||
if err == nil && uint64(time.Since(connAt)) > uint64(c.maxConnectionAge)+uint64(c.maxConnectionAge/2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used to handle wle leak in some case
// 1. disconnect: the workload entry has been updated | ||
// 2. connect: but the patch is based on the old workloadentry because of the propagation latency. | ||
// So in this case the `DisconnectedAtAnnotation` is still there and the cleanup procedure will go on. | ||
connTime := wle.Annotations[ConnectedAtAnnotation] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach would probably work. I think we would also need the merge patch, or to add retry on errors.IsInvalid
. If we delete the annotation on line 290 after we Get the entry for the patch, the patch will be confused whether to use op: add or op: replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without merge-patch, the json-patch can return error, and then the stream will be recreated. We depend on the clientside retry.
If merge patch applied, it may mask potential races? I am not sure how much effort would be needed to handle that. This is the main concern i have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do retries, we probably want to leave as many annotations set as possible so that we can determine ordering. With the retry approach, we have a true "last-write-wins" on the entire object. Maybe this is easier.
Using merge means we avoid retries which seems easier to think about when we have this concurrent access. That approach is optimistic for connections, but discards disconnections more easily. There are a few more things to make it work included in my PR.
If we're going with the retry approach, I don't see a reason to use patch at all. If anything that would mask races, since it may or may not succeed even with a concurrent modification, even without merge patch. It's probably best to change this to Update for the purposes of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr is not to care about whether patch, merge patch or update. The issue it tries to resolve can be not be done by purely change patch, merge patch or update
.
We can discuss that in a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the rest approach relies on setting/unsetting annotations, you're likely to completely fail out on the Patch
when not using merge patch. I'm fine with leaving the existing patch here, but adding retries on errors.IsInvalid.
If we're relying on retries anyway, it seems easier to just use Update + k8s.io/client-go/util/retry on errors.IsConflict. That way, we retry in all conflicts instead of some subset, making things easier to reason about and debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, update
sgtm.
/retest |
* fix workloadentry lost when fast reconnect happen * Handle fast reconnect: disconnect event later than reconnect event * Cleanup when both workload/pilot down at the same time before pilot has a chance to set disconnTime * fix ut * address comment
* Consistently lock and copy ads clients (#28968) * Consistently lock and copy ads clients fixes #28958 * fix wrong reference * optimize * fix race * Simplify internal generator and refactor a wle controller (#29554) * Refactor internal generator and workloadentry controller * Simplify internal gen * fix ci * fix lint * fix comment * Handle wle register/unregister race (#29604) * handle register/unregister wle race * update * update tmpl * Fix ut * fix lint * Fix workloadentry lost when fast reconnect (#29685) * fix workloadentry lost when fast reconnect happen * Handle fast reconnect: disconnect event later than reconnect event * Cleanup when both workload/pilot down at the same time before pilot has a chance to set disconnTime * fix ut * address comment * Fix leaks in workload entry controller (#29793) * vm auto registration: discard stale disconnect events (#29691) * implement merge patch in crd controller * make wle auto registration robust to register unregister race * release note Co-authored-by: John Howard <[email protected]> Co-authored-by: Zhonghu Xu <[email protected]>
This can be reproduced using the pilot-load tools