Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout stuck in Progressing without aborting when (Cluster)AnalysisTemplate is missing required arg #872

Closed
davidxia opened this issue Dec 2, 2020 · 1 comment · Fixed by #1094 or #1117
Assignees
Labels
bug Something isn't working
Milestone

Comments

@davidxia
Copy link
Contributor

davidxia commented Dec 2, 2020

Summary

If a Rollout doesn't provide a required arg for a (Cluster)AnalysisTemplate, the Rollout remains stuck on the same step index with a status.conditions[0].type of Progressing indefinitely. I expect the Rollout to transition to an error state and the canary would fail and be rolled back.

Diagnostics

0.10.0

time="2020-12-02T10:04:48Z" level=info msg="Started syncing rollout at (2020-12-02 10:04:48.023157319 +0000 UTC m=+53682.124478484)" namespace=warpspeed rollout=dxia-test
time="2020-12-02T10:04:48Z" level=info msg="Reconciling analysis step (stepIndex: 2)" namespace=warpspeed rollout=dxia-test
time="2020-12-02T10:04:48Z" level=info msg="Reconciliation completed" namespace=warpspeed rollout=dxia-test time_ms=3.0881670000000003
time="2020-12-02T10:04:48Z" level=error msg="rollout syncHandler error: args.container was not resolved" namespace=warpspeed rollout=dxia-test
E1202 10:04:48.026334       1 controller.go:172] args.container was not resolved
time="2020-12-02T10:21:28Z" level=info msg="Started syncing rollout at (2020-12-02 10:21:28.026663647 +0000 UTC m=+54682.127984826)" namespace=warpspeed rollout=dxia-test
time="2020-12-02T10:21:28Z" level=info msg="Reconciling analysis step (stepIndex: 2)" namespace=warpspeed rollout=dxia-test
time="2020-12-02T10:21:28Z" level=info msg="Reconciliation completed" namespace=warpspeed rollout=dxia-test time_ms=3.051261
time="2020-12-02T10:21:28Z" level=error msg="rollout syncHandler error: args.container was not resolved" namespace=warpspeed rollout=dxia-test
E1202 10:21:28.029832       1 controller.go:172] args.container was not resolved

[repeat forever]

My Rollout

kubectl -n warpspeed describe rollouts dxia-test

Name:         dxia-test
Namespace:    warpspeed
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"argoproj.io/v1alpha1","kind":"Rollout","metadata":{"annotations":{},"name":"dxia-test","namespace":"warpspeed"},"spec":{"mi...
              rollout.argoproj.io/revision: 2
API Version:  argoproj.io/v1alpha1
Kind:         Rollout
Metadata:
  Creation Timestamp:  2020-12-01T20:20:24Z
  Generation:          2
  Resource Version:    55979275
  Self Link:           /apis/argoproj.io/v1alpha1/namespaces/warpspeed/rollouts/dxia-test
  UID:                 4f7e3931-fd11-4b68-9c21-79ae317a6c7b
Spec:
  Min Ready Seconds:       30
  Replicas:                5
  Revision History Limit:  3
  Selector:
    Match Labels:
      App:  nginx
  Strategy:
    Canary:
      Canary Metadata:
        Labels:
          Stage:  canary
      Stable Metadata:
        Labels:
          Stage:  stable
      Steps:
        Set Weight:  20
        Pause:
          Duration:  1m
        Analysis:
          Templates:
            Cluster Scope:  true
            Template Name:  apollo-default
        Set Weight:         60
        Pause:
          Duration:  1m
        Analysis:
          Templates:
            Cluster Scope:  true
            Template Name:  apollo-default
        Set Weight:         100
        Pause:
          Duration:  1m
        Analysis:
          Templates:
            Cluster Scope:  true
            Template Name:  apollo-default
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        App:  nginx
    Spec:
      Containers:
        Image:  nginx:1.18.0
        Name:   nginx
        Ports:
          Container Port:  80
        Resources:
          Limits:
            Cpu:     800m
            Memory:  4G
          Requests:
            Cpu:     200m
            Memory:  1G
Status:
  HPA Replicas:        5
  Available Replicas:  5
  Blue Green:
  Canary:
  Conditions:
    Last Transition Time:  2020-12-01T20:21:32Z
    Last Update Time:      2020-12-01T20:21:32Z
    Message:               Rollout has minimum availability
    Reason:                AvailableReason
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-12-01T20:23:19Z
    Last Update Time:      2020-12-01T20:24:19Z
    Message:               Rollout is resumed
    Reason:                RolloutResumed
    Status:                Unknown
    Type:                  Progressing
  Current Pod Hash:        5687494758
  Current Step Hash:       75c9b5b4d5
  Current Step Index:      2
  Observed Generation:     2
  Ready Replicas:          5
  Replicas:                5
  Selector:                app=nginx
  Stable RS:               557b7ffd5
  Updated Replicas:        1
Events:                    <none>

My ClusterAnalysisTemplate

kubectl --context gke_gke-xpn-1_europe-west1_testing-europe-west1 get clusteranalysistemplates apollo-default -o yaml                                  ⮂ 11:11:27 ⮂ 2020-12-02
apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  creationTimestamp: "2020-09-29T14:54:55Z"
  generation: 2
  name: apollo-default
  resourceVersion: "55936984"
  selfLink: /apis/argoproj.io/v1alpha1/clusteranalysistemplates/apollo-default
  uid: 5adaa4f7-7674-416a-9d56-350e48144283
spec:
  args:
  - name: container
  - name: role
  - name: namespace
  - name: lookback-time
    value: "600"
  - name: http-error-ratio-threshold
    value: "0.01"
  - name: http-max-latency-threshold
    value: "100000000"
  - name: cpu-util-threshold
    value: "0.8"
  - name: memory-util-threshold
    value: "0.8"
  metrics:
  - name: http-error-ratio
    provider:
      job:
        spec:
          backoffLimit: 1
          template:
            spec:
              containers:
              - args:
                - judge
                - --key apollo
                - --filter platform=gke
                - --filter env=production
                - --filter role={{ args.role }}
                - --filter what=endpoint-request-rate
                - --filter "status-code^5"
                - --filter stat=1m
                - --filter "unit=request/s"
                - --lookback-time {{ args.lookback-time }}
                - --aggregation group,podname,max
                - --judgment-threshold '<={{ args.http-error-ratio-threshold }}'
                - --missing-metrics-allowed
                image: gcr.io/action-containers/canary-judger-heroic:1.0
                name: canary-judger-heroic
                resources:
                  limits:
                    cpu: 100m
                    memory: 200M
                  requests:
                    cpu: 100m
                    memory: 200M
              restartPolicy: Never
  - name: http-max-latency
    provider:
      job:
        spec:
          backoffLimit: 1
          template:
            spec:
              containers:
              - args:
                - judge
                - --key apollo
                - --filter platform=gke
                - --filter env=production
                - --filter role={{ args.role }}
                - --filter what=endpoint-request-duration
                - --filter stat=p99
                - --lookback-time {{ args.lookback-time }}
                - --aggregation group,endpoint,max
                - --judgment-threshold '<={{ args.http-max-latency-threshold }}'
                - --missing-metrics-allowed
                image: gcr.io/action-containers/canary-judger-heroic:1.0
                name: canary-judger-heroic
                resources:
                  limits:
                    cpu: 100m
                    memory: 200M
                  requests:
                    cpu: 100m
                    memory: 200M
              restartPolicy: Never
  - name: container-restarts
    provider:
      job:
        spec:
          backoffLimit: 1
          template:
            spec:
              containers:
              - args:
                - judge
                - --key kube-state-metrics
                - --filter container={{ args.container }}
                - --filter namespace={{ args.namespace }}
                - --filter what=kube_pod_container_status_restarts_total
                - --lookback-time {{ args.lookback-time }}
                - --aggregation group,container,delta
                - --aggregation group,null,notNegative
                - --judgment-threshold '=0'
                image: gcr.io/action-containers/canary-judger-heroic:1.0
                name: canary-judger-heroic
                resources:
                  limits:
                    cpu: 100m
                    memory: 200M
                  requests:
                    cpu: 100m
                    memory: 200M
              restartPolicy: Never
  - name: cpu-util
    provider:
      job:
        spec:
          backoffLimit: 1
          template:
            spec:
              containers:
              - args:
                - judge-ratio
                - --n-key kube-resource-metrics
                - --n-filter env=production
                - --n-filter namespace={{ args.namespace }}
                - --n-filter container={{ args.container }}
                - --n-filter what=gke-container-cpu-usage
                - --n-filter unit=cores
                - --n-aggregation group,container,max
                - --d-key kube-state-metrics
                - --d-filter env=production
                - --d-filter namespace={{ args.namespace }}
                - --d-filter container={{ args.container }}
                - --d-filter what=kube_pod_container_resource_limits_cpu_cores
                - --lookback-time {{ args.lookback-time }}
                - --judgment-threshold '<{{ args.cpu-util-threshold }}'
                image: gcr.io/action-containers/canary-judger-heroic:1.0
                name: canary-judger-heroic
                resources:
                  limits:
                    cpu: 100m
                    memory: 200M
                  requests:
                    cpu: 100m
                    memory: 200M
              restartPolicy: Never
  - name: memory-util
    provider:
      job:
        spec:
          backoffLimit: 1
          template:
            spec:
              containers:
              - args:
                - judge-ratio
                - --n-key kube-resource-metrics
                - --n-filter env=production
                - --n-filter namespace={{ args.namespace }}
                - --n-filter container={{ args.container }}
                - --n-filter what=gke-container-mem-usage
                - --n-filter unit=bytes
                - --n-aggregation group,container,max
                - --d-key kube-state-metrics
                - --d-filter env=production
                - --d-filter namespace={{ args.namespace }}
                - --d-filter container={{ args.container }}
                - --d-filter what=kube_pod_container_resource_limits_memory_bytes
                - --lookback-time {{ args.lookback-time }}
                - --judgment-threshold '<{{ args.memory-util-threshold }}'
                image: gcr.io/action-containers/canary-judger-heroic:1.0
                name: canary-judger-heroic
                resources:
                  limits:
                    cpu: 100m
                    memory: 200M
                  requests:
                    cpu: 100m
                    memory: 200M
              restartPolicy: Never

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@davidxia davidxia added the bug Something isn't working label Dec 2, 2020
@jessesuen jessesuen added this to the v0.11 milestone Dec 2, 2020
@khhirani khhirani self-assigned this Feb 22, 2021
@jessesuen
Copy link
Member

jessesuen commented Apr 27, 2021

The fix in #1094 caused a different regression and is being reverted in #1118, so reopening this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants