Migrating from Deployment to Canary causes downtime #1444

miguelvalerio · 2023-06-23T12:10:52Z

Describe the bug

Currently, we have the following (simplified) setup:

apiVersion: traefik.containo.us/v1alpha1
kind: TraefikService
metadata:
  name: test-app
  namespace: test
spec:
  weighted:
    services:
      - name: test-app
        namespace: test
        port: 5000
        weight: 100
        
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: test-app
  name: test-app
  namespace: test
spec:
  ports:
    - name: http
      port: 5000
      protocol: TCP
      targetPort: 5000
  selector:
    app.kubernetes.io/name: test-app
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: test-app
  name: test-app
  namespace: test
spec:
  progressDeadlineSeconds: 100
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/name: test-app
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: test-app
    spec:
      containers:
          image: >-
            some-image
          name: test-app
          ports:
            - containerPort: 5000
              name: http
              protocol: TCP
         .......

During the initialization that is performed when migrating from a standard Deployment to a Canary, there is a slight downtime when we apply the following canary:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name:  test-app
spec:
  provider: traefik
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-app
  progressDeadlineSeconds: 60
  service:
    port: 5000
    targetPort: 5000
  analysis:
    interval: 1m
    threshold: 10
    maxWeight: 50
    stepWeight: 20

This happens because final part of the initialization (from my understanding) is done in this order:

Scaling down the replicas for the test-app Deployment to 0
Update the test-app Service's selectorLabels to reference the primary pods
Update the TraefikService to point to the test-app-primary Service

Due to this, during steps 1 and 2, there is a slight time window in which there will be a couple of 502 errors returned, since the test-app Service has no pods to reference.

To Reproduce

Apply the configurations above, and have some tool such as https://github.com/tsenart/vegeta doing requests to the ingress, and see the 502 errors being returned.

Expected behavior

No 502s returned when migrating to Canary.

Additional context

Flagger version: v1.31.0
Kubernetes version: 1.25
Service Mesh provider: traefik

Updatting the Service before scaling down the Deployment (swapping points 1 and 2) seems to be a good option to fix this issue. I wouldn't mind submitting a PR with this fix, but I'd like to make sure that this would be a correct approach.

The text was updated successfully, but these errors were encountered:

aryan9600 · 2023-06-30T10:11:24Z

hello @miguelvalerio, thanks for opening this issue and volunteering to fix it! please take it on, your recommended fix should work. looking forward to your PR :)

miguelvalerio mentioned this issue Jul 2, 2023

Fix initial deployment downtime #1451

Merged

aryan9600 closed this as completed in #1451 Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrating from Deployment to Canary causes downtime #1444

Migrating from Deployment to Canary causes downtime #1444

miguelvalerio commented Jun 23, 2023 •

edited

Loading

aryan9600 commented Jun 30, 2023

Migrating from Deployment to Canary causes downtime #1444

Migrating from Deployment to Canary causes downtime #1444

Comments

miguelvalerio commented Jun 23, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

aryan9600 commented Jun 30, 2023

miguelvalerio commented Jun 23, 2023 •

edited

Loading