Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aborting a fully-promoted traffic-routed rollout causes controller scale-down-deadline hot-loop #2982

Closed
jessesuen opened this issue Aug 22, 2023 · 2 comments · Fixed by #3064
Labels
bug Something isn't working workaround There's a workaround, might not be great, but exists

Comments

@jessesuen
Copy link
Member

jessesuen commented Aug 22, 2023

Describe the bug

A traffic routed rollout that is fully promoted (stableRS == currentPodHash), and then is aborted, will cause the controller to get into a hot loop where it continuously sets and unsets the scale-down-deadline on the desired ReplicaSet.

In one user's case, the controller would reconcile the same rollout up to 1000 times in a minute.

To Reproduce

  1. Deploy this rollout and allow it to reach a healthy state:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: istio-canary
spec:
  replicas: 3
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      app: istio-canary
  template:
    metadata:
      labels:
        app: istio-canary
        sidecar.istio.io/inject: "true"
    spec:
      containers:
      - name: istio-canary
        image: argoproj/rollouts-demo:green
        ports:
        - name: http
          containerPort: 8080
          protocol: TCP
        resources:
          requests:
            memory: 32Mi
            cpu: 5m
  strategy:
    canary:
      trafficRouting:
        istio:
          virtualService:
            name: istio-canary
          destinationRule:
            name: istio-canary
            canarySubsetName: canary
            stableSubsetName: stable
      steps:
      - setWeight: 5
      - pause: {}
  1. Abort the rollout, e.g. from CLI or Argo CD UI:
kubectl argo rollouts abort istio-canary

Expected behavior

Rollout should not get into a hot loop

Version

v1.4.1, v1.5.0

Logs

time="2023-08-22T19:49:19Z" level=info msg="Scale down new rs 'istio-canary-6c96c9c5d4' on abort (30s)" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Enqueueing parent of jesse-rollout-scaledown/istio-canary-6c96c9c5d4: Rollout jesse-rollout-scaledown/istio-canary"
time="2023-08-22T19:49:19Z" level=info msg="Set 'scale-down-deadline' annotation on 'istio-canary-6c96c9c5d4' to 2023-08-22T19:49:49Z (30s)" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No Steps remain in the canary steps" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No status changes. Skipping patch" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Reconciliation completed" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary time_ms=27.170599
time="2023-08-22T19:49:19Z" level=info msg="Started syncing rollout" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Reconciling TrafficRouting with type 'Istio'" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Enqueueing parent of jesse-rollout-scaledown/istio-canary-6c96c9c5d4: Rollout jesse-rollout-scaledown/istio-canary"
time="2023-08-22T19:49:19Z" level=info msg="Removed 'scale-down-deadline' annotation from RS 'istio-canary-6c96c9c5d4'" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Scale down new rs 'istio-canary-6c96c9c5d4' on abort (30s)" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="New rs 'istio-canary-6c96c9c5d4' has scaledown deadline annotation: 2023-08-22T19:49:49Z" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="RS 'istio-canary-6c96c9c5d4' has not reached the scaleDownTime" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No Steps remain in the canary steps" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No status changes. Skipping patch" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Reconciliation completed" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary time_ms=23.085656
time="2023-08-22T19:49:19Z" level=info msg="Started syncing rollout" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Reconciling TrafficRouting with type 'Istio'" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: false, initialDeploy: false" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Scale down new rs 'istio-canary-6c96c9c5d4' on abort (30s)" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Enqueueing parent of jesse-rollout-scaledown/istio-canary-6c96c9c5d4: Rollout jesse-rollout-scaledown/istio-canary"
time="2023-08-22T19:49:19Z" level=info msg="Set 'scale-down-deadline' annotation on 'istio-canary-6c96c9c5d4' to 2023-08-22T19:49:49Z (30s)" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No Steps remain in the canary steps" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No status changes. Skipping patch" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Reconciliation completed" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary time_ms=26.221208
time="2023-08-22T19:49:19Z" level=info msg="Started syncing rollout" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Reconciling TrafficRouting with type 'Istio'" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: true, initialDeploy: false" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Enqueueing parent of jesse-rollout-scaledown/istio-canary-6c96c9c5d4: Rollout jesse-rollout-scaledown/istio-canary"
time="2023-08-22T19:49:19Z" level=info msg="Removed 'scale-down-deadline' annotation from RS 'istio-canary-6c96c9c5d4'" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Scale down new rs 'istio-canary-6c96c9c5d4' on abort (30s)" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="New rs 'istio-canary-6c96c9c5d4' has scaledown deadline annotation: 2023-08-22T19:49:49Z" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="RS 'istio-canary-6c96c9c5d4' has not reached the scaleDownTime" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No Steps remain in the canary steps" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No status changes. Skipping patch" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Reconciliation completed" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary time_ms=24.025531
time="2023-08-22T19:49:19Z" level=info msg="Started syncing rollout" generation=7 namespace=jesse-rollout-scaledown resourceVersion=66514968 rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Reconciling TrafficRouting with type 'Istio'" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: false, initialDeploy: false" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=jesse-rollout-scaledown rollout=istio-canary
time="2023-08-22T19:49:19Z" level=info msg="Scale down new rs 'istio-canary-6c96c9c5d4' on abort (30s)" namespace=jesse-rollout-scaledown rollout=istio-canary```

---
<!-- Issue Author: Don't delete this message to encourage other users to support your issue! -->
**Message from the maintainers**:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
@jessesuen jessesuen added the bug Something isn't working label Aug 22, 2023
@jessesuen
Copy link
Member Author

The workaround is to unset:

status:
  abort: true
  abortedAt: "2023-08-18T12:52:23Z"

I think the fix should be that:

  1. we do not attempt to abort the rollout (set scale-down-deadline on the desired RS) if we are in a fully promoted state.
  2. we should also immediately clear status.abort and status.abortedAt if we are fully promoted

Some protection logic must already exist to remove scale-down-deadline on the desired RS, so the good news is that this does not cause an outage.

@jessesuen
Copy link
Member Author

NOTE: Argo CD Rollout abort action already has some prevention of making the Abort action greyed out when the rollout is fully promoted:

https://github.com/argoproj/argo-cd/blob/master/resource_customizations/argoproj.io/Rollout/actions/discovery.lua#L12-L14

But it's not fool proof. It also doesn't prevent CLI from setting status.abort.

@jessesuen jessesuen added the workaround There's a workaround, might not be great, but exists label Aug 22, 2023
@jessesuen jessesuen changed the title Aborting a fully-promoted rollout with traffic routing causes controller hot-loop Aborting a fully-promoted traffic-routed rollout causes controller scale-down-deadline hot-loop Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working workaround There's a workaround, might not be great, but exists
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant