-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argo rollouts often hangs with long running deploys #1193
Comments
Are you using scaleDownDelaySeconds? |
I'm not using |
My setup is using canary. |
Are you using canary + traffic routing? There is a fix in v1.0 plugin which may address this: 31afa28#diff-e6177f3a3015ec10cd7a2cfcb6293603125bbfb53bfdbd51f342507d084e6d2e You can download v1.0-rc1 plugin here: |
Yes, I'm using canary with traffic routing (linkerD). The problem really isn't with the plugin, it's that Argo can't scale down old pods for a considerable time. |
Argo should set the If you see that the old ReplicaSet has FYI, we should emit Kubernetes events for these events so that will aide in the debugging. |
The replica set remains set to non-zero for a considerable amount of time. In fact, I see an error about being unable to update the replicaset at the same time Argo moves to the "waiting for pods to terminate state". I suspect that Argo is trying to update the RS, failing, and then not retrying. What events do you want me to extract from the event history? |
This is a good theory, especially if you notice the error in logs. Rollout has a default resync period of 15m, which is actually somewhat excessive and I think the default can be reduced to 5m. To test your theory, a workaround could be to decrease this to a lower value through the CLI option to the controller. e.g. 5 minutes:
The above setting will configure it such that if an error updating the ReplicaSet occurs, and for some reason we don't retry immediately (which is the expected behavior), then it will take at most 5m before it attempts it again. |
Would it cause problem to drop the resync down to like 60s? |
The lower the number, the more frequently there are reconciliations. More frequent reconciliations means:
So I would say it depends somewhat on the number of rollouts in the system. At the end of the day, 60s probably will not cause problems other than additional CPU resources, but I would keep an eye out for potentially more API calls. We have a prometheus metric to track this if you need to measure. |
Let us know if reducing the resync period helped the problem. It will help pinpoint the cause of this bug (e.g. it implies we are not requeuing rollouts on errors properly) |
I actually think this is no longer a problem in v1.0. In v1.0, we introduced a scaleDownDelay for canary+trafficRouting. In v1.0 there is now a (configurable) default scaleDownDelay of 30s before the rollout scales down the old stack. The reason for leaving the old stack running for 30s, is to give a chance for service meshes & ingress controllers, to adjust/propagate the traffic weight changes which the rollout had made to the underlying network objects. Before scaleDownDelay, we were scaling down the old stack immediately after promoting the canary, which could cause brief 500 errors if the mesh provider hadn't yet fully made the weight changes and the rollout pods of the old stack started shutting down. In other words, the the whole process of scaling down the old replicaset has changed in v1.0 and so this bug is probably not applicable anymore. |
@jessesuen I've tried setting the rollout-resync to 60s and it hasn't improved the termination latency. I can investigate upgrading the argo-controller to 1.0-rc. Is there a timeline for a 1.0 release? |
That would be helpful! v1.0 is now released! |
Ok, I've tried with the new 1.0.1 release and I'm still seeing the behaviour described above. |
It turns out there is a bug in the use of the workqueue for rollouts. Something is adding the rollout object back to the queue dozens of times. This causes the exponential back-off queue to basically immediately hit the 16 minute limit (!). I added some logging in
Code I used to check
Something is calling |
I've got a mitigation for the issue - #1243 |
…queue. #1193 (#1243) This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout. Signed-off-by: Mark Robinson <[email protected]>
…queue. argoproj#1193 (argoproj#1243) This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout. Signed-off-by: Mark Robinson <[email protected]> Signed-off-by: caoyang001 <[email protected]>
Fixed in #1243 |
…queue. argoproj#1193 (argoproj#1243) This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout. Signed-off-by: Mark Robinson <[email protected]> Signed-off-by: caoyang001 <[email protected]>
Summary
When deploying roll outs with a long step time (>10m), Argo rollouts will often hang on the last step.
AR will get to 100% deployed, with all traffic on the new pods but will get stuck tearing down the old pods. Using kubectl plugin, the message
Message: old replicas are pending termination
will be displayed for a considerable amount of time (10-30 minutes). Eventually Argo will be able to terminate the pods.Ideally Argo should be able to kill these pods in less time.
Diagnostics
Argo: 0.10.2
K8s: 1.17
There's a lot of log data
https://gist.github.com/MarkSRobinson/e44ce06689aa02dd2d7886482c152fe2
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: