-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(trafficrouting): Do not block the switch of service selectors for single pod failures #2441
Conversation
Signed-off-by: zachaller <[email protected]>
65cd406
to
6c21625
Compare
Signed-off-by: zachaller <[email protected]>
Codecov ReportBase: 81.58% // Head: 81.59% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #2441 +/- ##
==========================================
+ Coverage 81.58% 81.59% +0.01%
==========================================
Files 124 124
Lines 18959 18973 +14
==========================================
+ Hits 15467 15481 +14
Misses 2702 2702
Partials 790 790
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
|
||
if checkRsAvailability && !replicasetutil.IsReplicaSetPartiallyAvailable(rs) { | ||
logCtx.Infof("delaying service switch from %s to %s: ReplicaSet has zero availability", currSelector, desiredSelector) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When service switch is delayed, should it return err, so that it will not move on to the next step reconcileTrafficRouting
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that was actually one of the first things that was tried and also does fix the issue is by introducing this error state however there are dependencies within rollouts that we still have to continue reconciling to keep progressing. There is quite a bit of background here #2187 as well as here #1777 where the intial return nil was introduced. It was introduced to prevent the adoption of a services causing downtime by making sure the service was fully available. The change in this PR keeps that behavior for the adoption case but switches to only needing some availability for the cases that we have already adopted the service.
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thank you!
@leoluz what else remains to get this merged. Any chance of it being considered a more urgent bug for a second point release rather than having to wait all the way until 1.4? This has caused my organization two major production outages, and I'd like to not make it a third :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
@jessesuen are you a more appropriate party to ask how fast-tracked this can be? This is a really fatal bug and tbqh I'd like to see an emerg 1.3.x process for this but idk how to go about asking for that here. |
@jstewart612 I spoke with @zachaller about it and we can do a 1.3 patch release with this fix once we are ready. If you can't wait for the official patch release, as an alternative, you could do an "in-house" release for your organization. |
@leoluz : I appreciate the response. That's the aforementioned monkey-patching that I'd really like to avoid doing... but will if I must. Any idea of timetable on what was discussed on getting that done? |
@jstewart612 We are planning to following the timetable as close as possible for new releases such as 1.4 here as for the 1.3.2 patch release as soon as we hear back from the people at plaid that the fix does indeed seem solve the issue we can do a 1.3.2 release. |
@zachaller I'm not them, much to my chagrin, but can confirm this resolves the issue in my local dev env testing. |
@raxod502-plaid @jandersen-plaid are you the aforementioned Plaid people? |
@MarkSRobinson was testing this. Deployments take some time and the issue was never possible to reproduce on demand, so confirmation may take some time. |
@raxod502-plaid @MarkSRobinson considering argoproj official maintainers seem to be holding on your word, could you speed that up? |
If you would like to test this change you are welcome to -- usually this situation arises when there is a lack of nodes or scheduling issues, so you will just have to simulate nodes failing to come up or pods failing to schedule and hope that you hit this bug. In terms of Plaid "speeding things up": if you believe this change will solve the issue then you are welcome to fork the project and apply this PR as a patch or make whatever changes you would like to solve this particular issue (considering this is an open source project). If you want guarantees about timely fixes then you would have to talk to the argo team who may or may not be able to offer you a support contract. For Plaid specifically, we are working at our own pace given internal priorities and resourcing and offer no guarantees about when we will be finished testing the issue. |
@zachaller based on this #2441 (comment) it seems a bad idea to wait on resource-strapped reporters for QA who make no guarantees about their ability to follow up. What data do you require of me to validate the fix beyond my affirmation that it has done so in my environment? |
and @jandersen-plaid don't worry: I know what monkey-patching is and my org is already doing it as a result of having to wait here. I would just like to stop doing it because of all the caveats of running in that state. As you rightly note though, you are neither the one deciding to wait on your word nor presently possessed of the resources to do this timely (empathy on that latter point!). |
@jstewart612 I would be very curious on how you tested this locally because part of the reason I am waiting on their response is I was never able to test this locally this code deals with a very specific window of time at the end of a rollout that is quite hard if not impossible to replicate locally hence I am curious on how you tested because it could allow me to create a e2e test for this issue. It would also give me more confidence in knowing how you reproduced and tested it. I do not plan on waiting for their response in order for this change to make it into the 1.4 release because I am still pretty confident that the fixed behavior is correct. This issue should also really mostly affect @jandersen-plaid Sorry to hear about the layoffs I hope it does not affect you and others that have helped the rollouts community |
@zachaller any Rollout. dynamicStableScaling true. Change the image tag as to trigger a rollout. Happens to us every time that the Rollout entirely completes, but the rollouts-pod-template-hash on the stableService remains the old value indefinitely until we manually change it and I don't see anything in the logs indicating a failure or attempt to change this. |
@zachaller also I have tested dynamicStableScale false. Now, I don't have downtime, but it takes over 35 minutes to scale down 25 pods and tons of update errors about replicaset being modified cannot apply show up in my logs. This, I am going to assume, is the real problem. |
@jstewart612 Yea I was going to say there must be something going on in your cluster that is causing the canary to never go 100% healthy causing the outage. With the very slow scale down it sounds like something is also updating your replicasets preventing rollouts controller from updating the replica counts. |
@zachaller at the same time: you shouldn't be turning down, in the dynamicStableScale true case, the stable RS while under this condition, agreed (the action that causes the downtime)? |
@zachaller well turns out it's Lacework Admissions Controller and Argo Rollouts conflicting with each other. I'll separately bug-ticket them about that. dynamicStableScale true is still broken as previously claimed, even with this patch, but false does fine when someone else isn't jumping in and modifying objects out from under it :) |
What makes you say this patch does not fix the |
@zachaller The new canary RS was up with one or more pods. The Istio VirtualService was adjusted to weight one hundred percent to the canaryService (which was set as the final step of this particular Rollout). The stable RS then scaled down to zero. The Istio VirtualService then switched weights back to the stableService. The rollouts-pod-template-hash label selector on the stableService was still pointed at the old pod template hash it had just scaled to 0. Outage. |
@zachaller what is the risk of merging the code & releasing a patch? |
I plan on releasing a patch next week with or without an update from plaid, I am going to go through pr's and see if there are any other issues worth pulling in as well. |
… single pod failures (#2441) * fix(traficrouter): WIP on not setting weight if not available Signed-off-by: zachaller <[email protected]> * fix tests Signed-off-by: zachaller <[email protected]> * try bailing vs setting weight Signed-off-by: zachaller <[email protected]> * work with expirments that do not set any weights Signed-off-by: zachaller <[email protected]> * fix test by commenting out code Signed-off-by: zachaller <[email protected]> * lint Signed-off-by: zachaller <[email protected]> * simplify logic Signed-off-by: zachaller <[email protected]> * switch logic Signed-off-by: zachaller <[email protected]> * add more comments Signed-off-by: zachaller <[email protected]> * add more comments Signed-off-by: zachaller <[email protected]> * add more test Signed-off-by: zachaller <[email protected]> * refactor test Signed-off-by: zachaller <[email protected]> * refactor code to reduce duplication Signed-off-by: zachaller <[email protected]> * change comments a bit Signed-off-by: zachaller <[email protected]> * remove else Signed-off-by: zachaller <[email protected]> Signed-off-by: zachaller <[email protected]>
… single pod failures (argoproj#2441) * fix(traficrouter): WIP on not setting weight if not available Signed-off-by: zachaller <[email protected]> * fix tests Signed-off-by: zachaller <[email protected]> * try bailing vs setting weight Signed-off-by: zachaller <[email protected]> * work with expirments that do not set any weights Signed-off-by: zachaller <[email protected]> * fix test by commenting out code Signed-off-by: zachaller <[email protected]> * lint Signed-off-by: zachaller <[email protected]> * simplify logic Signed-off-by: zachaller <[email protected]> * switch logic Signed-off-by: zachaller <[email protected]> * add more comments Signed-off-by: zachaller <[email protected]> * add more comments Signed-off-by: zachaller <[email protected]> * add more test Signed-off-by: zachaller <[email protected]> * refactor test Signed-off-by: zachaller <[email protected]> * refactor code to reduce duplication Signed-off-by: zachaller <[email protected]> * change comments a bit Signed-off-by: zachaller <[email protected]> * remove else Signed-off-by: zachaller <[email protected]> Signed-off-by: zachaller <[email protected]> Signed-off-by: Travis Perdue <[email protected]>
… single pod failures (argoproj#2441) * fix(traficrouter): WIP on not setting weight if not available Signed-off-by: zachaller <[email protected]> * fix tests Signed-off-by: zachaller <[email protected]> * try bailing vs setting weight Signed-off-by: zachaller <[email protected]> * work with expirments that do not set any weights Signed-off-by: zachaller <[email protected]> * fix test by commenting out code Signed-off-by: zachaller <[email protected]> * lint Signed-off-by: zachaller <[email protected]> * simplify logic Signed-off-by: zachaller <[email protected]> * switch logic Signed-off-by: zachaller <[email protected]> * add more comments Signed-off-by: zachaller <[email protected]> * add more comments Signed-off-by: zachaller <[email protected]> * add more test Signed-off-by: zachaller <[email protected]> * refactor test Signed-off-by: zachaller <[email protected]> * refactor code to reduce duplication Signed-off-by: zachaller <[email protected]> * change comments a bit Signed-off-by: zachaller <[email protected]> * remove else Signed-off-by: zachaller <[email protected]> Signed-off-by: zachaller <[email protected]> Signed-off-by: Travis Perdue <[email protected]>
fixes: #2050
fixes: #2235