Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo rollouts often hangs with long running deploys #1193

Closed
MarkSRobinson opened this issue May 17, 2021 · 18 comments
Closed

Argo rollouts often hangs with long running deploys #1193

MarkSRobinson opened this issue May 17, 2021 · 18 comments
Labels
bug Something isn't working

Comments

@MarkSRobinson
Copy link
Contributor

Summary

When deploying roll outs with a long step time (>10m), Argo rollouts will often hang on the last step.

AR will get to 100% deployed, with all traffic on the new pods but will get stuck tearing down the old pods. Using kubectl plugin, the message Message: old replicas are pending termination will be displayed for a considerable amount of time (10-30 minutes). Eventually Argo will be able to terminate the pods.

Ideally Argo should be able to kill these pods in less time.

Diagnostics

Argo: 0.10.2
K8s: 1.17

There's a lot of log data
https://gist.github.com/MarkSRobinson/e44ce06689aa02dd2d7886482c152fe2


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@MarkSRobinson MarkSRobinson added the bug Something isn't working label May 17, 2021
@MarkSRobinson MarkSRobinson changed the title Argo rollouts often hangs will long running deploys Argo rollouts often hangs with long running deploys May 17, 2021
@jessesuen
Copy link
Member

Are you using scaleDownDelaySeconds?
Which version of the plugin? Is that also 0.10.2 as well?

@MarkSRobinson
Copy link
Contributor Author

I'm not using scaleDownDelaySeconds and the plugin version is v0.10.0+3837eab

@MarkSRobinson
Copy link
Contributor Author

My setup is using canary.

@jessesuen
Copy link
Member

Are you using canary + traffic routing? There is a fix in v1.0 plugin which may address this:

31afa28#diff-e6177f3a3015ec10cd7a2cfcb6293603125bbfb53bfdbd51f342507d084e6d2e

You can download v1.0-rc1 plugin here:
https://github.com/argoproj/argo-rollouts/releases/tag/v1.0.0-rc1

@MarkSRobinson
Copy link
Contributor Author

Yes, I'm using canary with traffic routing (linkerD).

The problem really isn't with the plugin, it's that Argo can't scale down old pods for a considerable time.

@jessesuen
Copy link
Member

Argo can't scale down old pods for a considerable time.

Argo should set the spec.replicas of the old ReplicaSet stack to 0 shortly after the promotion of the canary. After that step, it's really up to the Pods to shut themselves down since they will get SIGTERMed by kubelet while honoring the pod's grace period seconds before getting SIGKILLed by kubelet.

If you see that the old ReplicaSet has spec.replicas: 0, then Argo Rollouts has done its job and scaledown time is affected by other factors not controllable by Argo Rollouts. However, if spec.replicas of the old ReplicaSet is non-zero for a long time after the canary promotion, then there may be some type of issue.

FYI, we should emit Kubernetes events for these events so that will aide in the debugging.

@MarkSRobinson
Copy link
Contributor Author

The replica set remains set to non-zero for a considerable amount of time. In fact, I see an error about being unable to update the replicaset at the same time Argo moves to the "waiting for pods to terminate state". I suspect that Argo is trying to update the RS, failing, and then not retrying.

What events do you want me to extract from the event history?

@jessesuen
Copy link
Member

I suspect that Argo is trying to update the RS, failing, and then not retrying.

This is a good theory, especially if you notice the error in logs. Rollout has a default resync period of 15m, which is actually somewhat excessive and I think the default can be reduced to 5m. To test your theory, a workaround could be to decrease this to a lower value through the CLI option to the controller. e.g. 5 minutes:

rollouts-controller --rollout-resync 300

The above setting will configure it such that if an error updating the ReplicaSet occurs, and for some reason we don't retry immediately (which is the expected behavior), then it will take at most 5m before it attempts it again.

@MarkSRobinson
Copy link
Contributor Author

Would it cause problem to drop the resync down to like 60s?

@jessesuen
Copy link
Member

Would it cause problem to drop the resync down to like 60s?

The lower the number, the more frequently there are reconciliations. More frequent reconciliations means:

  • higher CPU for the rollout controller
  • highter memory requirements (if you increase workers)
  • potentially more API calls. though, we try to reduce this as much as possible using local cache (i.e. kubernetes informers)

So I would say it depends somewhat on the number of rollouts in the system.

At the end of the day, 60s probably will not cause problems other than additional CPU resources, but I would keep an eye out for potentially more API calls. We have a prometheus metric to track this if you need to measure.

@jessesuen
Copy link
Member

Let us know if reducing the resync period helped the problem. It will help pinpoint the cause of this bug (e.g. it implies we are not requeuing rollouts on errors properly)

@jessesuen
Copy link
Member

jessesuen commented May 18, 2021

I actually think this is no longer a problem in v1.0. In v1.0, we introduced a scaleDownDelay for canary+trafficRouting. In v1.0 there is now a (configurable) default scaleDownDelay of 30s before the rollout scales down the old stack.

The reason for leaving the old stack running for 30s, is to give a chance for service meshes & ingress controllers, to adjust/propagate the traffic weight changes which the rollout had made to the underlying network objects. Before scaleDownDelay, we were scaling down the old stack immediately after promoting the canary, which could cause brief 500 errors if the mesh provider hadn't yet fully made the weight changes and the rollout pods of the old stack started shutting down.

In other words, the the whole process of scaling down the old replicaset has changed in v1.0 and so this bug is probably not applicable anymore.

@jessesuen jessesuen added the more-information-needed More information is needed to make progress label May 18, 2021
@MarkSRobinson
Copy link
Contributor Author

@jessesuen I've tried setting the rollout-resync to 60s and it hasn't improved the termination latency.

I can investigate upgrading the argo-controller to 1.0-rc. Is there a timeline for a 1.0 release?

@no-response no-response bot removed the more-information-needed More information is needed to make progress label May 19, 2021
@jessesuen
Copy link
Member

I can investigate upgrading the argo-controller to 1.0-rc. Is there a timeline for a 1.0 release?

That would be helpful! v1.0 is now released!

@MarkSRobinson
Copy link
Contributor Author

Ok, I've tried with the new 1.0.1 release and I'm still seeing the behaviour described above.

@MarkSRobinson
Copy link
Contributor Author

It turns out there is a bug in the use of the workqueue for rollouts. Something is adding the rollout object back to the queue dozens of times. This causes the exponential back-off queue to basically immediately hit the 16 minute limit (!).

I added some logging in utils/controller/controller.go:processNextWorkItem() and got

time="2021-06-02T01:58:47Z" level=error msg="rollout syncHandler obj retries: 55 : key \"\"deveff-team/example\"\"" namespace=deveff-team rollout=example

  | time="2021-06-02T01:58:41Z" level=error msg="rollout syncHandler obj retries: 28 : key \"\"deveff-team/example\"\"" namespace=deveff-team rollout=example

  | time="2021-06-02T01:57:43Z" level=error msg="rollout syncHandler obj retries: 3 : key \"\"deveff-team/example\"\"" namespace=deveff-team rollout=example

Code I used to check

logCtx.Errorf("%s syncHandler obj retries: %v : key \"%#v\"", objType, workqueue.NumRequeues(key), key)

Something is calling workqueue.AddRateLimited(key) every other second. The result is that the back-off hits the ceiling almost instantly - and the limit is 16 minutes.

@MarkSRobinson
Copy link
Contributor Author

MarkSRobinson commented Jun 2, 2021

I've got a mitigation for the issue - #1243

jessesuen pushed a commit that referenced this issue Jun 9, 2021
…queue. #1193 (#1243)

This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout.

Signed-off-by: Mark Robinson <[email protected]>
caoyang001 pushed a commit to caoyang001/argo-rollouts that referenced this issue Jun 12, 2021
…queue. argoproj#1193 (argoproj#1243)

This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout.

Signed-off-by: Mark Robinson <[email protected]>
Signed-off-by: caoyang001 <[email protected]>
@MarkSRobinson
Copy link
Contributor Author

Fixed in #1243

huikang pushed a commit to huikang/argo-rollouts that referenced this issue Sep 16, 2021
…queue. argoproj#1193 (argoproj#1243)

This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout.

Signed-off-by: Mark Robinson <[email protected]>
Signed-off-by: caoyang001 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants