As a tekton pipeline user, I want to use liveness/readiness probes to check controller pod health #3111

xiujuan95 · 2020-08-18T09:45:37Z

Feature request

I want to use liveness and readiness probes to detect if my tekton controller pod is healthy or not. However, seem like liveness and readiness field don't be included in controller deployment:https://github.com/tektoncd/pipeline/blob/master/config/controller.yaml

About this request, actually, I have done some experiments. I configure liveness and readiness like below:

livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /metrics
            port: 9090
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /metrics
            port: 9090
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

And I check the pod event, it tells me the probes failed:

Events:
  Type     Reason     Age                From                  Message
  ----     ------     ----               ----                  -------
  Normal   Scheduled  35m                default-scheduler     Successfully assigned tekton-pipelines/tekton-pipelines-controller-7cd74569b7-mm96v to 10.242.0.19
  Normal   Pulled     35m                kubelet, 10.242.0.19  Container image "icr.io/obs/codeengine/tekton-pipeline/controller-10a3e32792f33651396d02b6855a6e36:v0.14.2-rc2@sha256:845358c3bb68e6900b421545ee40352391baccb751ef63915758016c4745bdbe" already present on machine
  Normal   Created    35m                kubelet, 10.242.0.19  Created container tekton-pipelines-controller
  Normal   Started    35m                kubelet, 10.242.0.19  Started container tekton-pipelines-controller
  Warning  Unhealthy  35m (x2 over 35m)  kubelet, 10.242.0.19  Readiness probe failed: Get http://172.30.18.233:9090/metrics: dial tcp 172.30.18.233:9090: connect: connection refused
  Warning  Unhealthy  35m (x2 over 35m)  kubelet, 10.242.0.19  Liveness probe failed: Get http://172.30.18.233:9090/metrics: dial tcp 172.30.18.233:9090: connect: connection refused

However, my pod is still running normally:

kubectl get pod -n tekton-pipelines -o wide
NAME                                           READY   STATUS    RESTARTS   AGE   IP               NODE            NOMINATED NODE   READINESS GATES
tekton-pipelines-controller-7cd74569b7-mm96v   1/1     Running   0          52m   172.30.18.233    10.242.0.19     <none>           <none>

This is not expected.

BTW, I can do curl http://localhost:9090/metrics command successfully within tekton controller container:

kubectl exec -ti tekton-pipelines-controller-7cd74569b7-mm96v -n tekton-pipelines sh
sh-4.4# curl http://localhost:9090/metrics
# HELP tekton_client_latency How long Kubernetes API requests take
# TYPE tekton_client_latency histogram
tekton_client_latency_bucket{name="",le="1e-05"} 13
tekton_client_latency_bucket{name="",le="0.0001"} 819
tekton_client_latency_bucket{name="",le="0.001"} 820
tekton_client_latency_bucket{name="",le="0.01"} 826
tekton_client_latency_bucket{name="",le="0.1"} 10358

Use case

The text was updated successfully, but these errors were encountered:

xiujuan95 · 2020-08-19T07:09:24Z

Now, I check pod event and find liveness/readiness probes failed message is gone:

kubectl describe pod tekton-pipelines-controller-7cd74569b7-mm96v -n tekton-pipelines
Name:         tekton-pipelines-controller-7cd74569b7-mm96v
Namespace:    tekton-pipelines
Priority:     0
Node:         10.242.0.19/10.242.0.19
Start Time:   Tue, 18 Aug 2020 04:49:50 -0400
Labels:       app=tekton-pipelines-controller
              app.kubernetes.io/component=controller
              app.kubernetes.io/instance=default
              app.kubernetes.io/name=controller
              app.kubernetes.io/part-of=tekton-pipelines
              app.kubernetes.io/version=devel
              pipeline.tekton.dev/release=devel
              pod-template-hash=7cd74569b7
              version=devel
Annotations:  container.apparmor.security.beta.kubernetes.io/tekton-pipelines-controller: runtime/default
              kubernetes.io/psp: ibm-coligo-restricted-psp
              prometheus.io/port: 9090
              prometheus.io/scrape: true
              seccomp.security.alpha.kubernetes.io/pod: docker/default
Status:       Running
IP:           172.30.18.233
IPs:
  IP:           172.30.18.233
Controlled By:  ReplicaSet/tekton-pipelines-controller-7cd74569b7
Containers:
  tekton-pipelines-controller:
    Container ID:  containerd://ac980aa330e0e2f0da70b013926e7528d278f0a41f27cc80c9a3ea02db051030
    Image:         icr.io/obs/codeengine/tekton-pipeline/controller-10a3e32792f33651396d02b6855a6e36:v0.14.2-rc2@sha256:845358c3bb68e6900b421545ee40352391baccb751ef63915758016c4745bdbe
    Image ID:      icr.io/obs/codeengine/tekton-pipeline/controller-10a3e32792f33651396d02b6855a6e36@sha256:845358c3bb68e6900b421545ee40352391baccb751ef63915758016c4745bdbe
    Port:          <none>
    Host Port:     <none>
    Args:
      -kubeconfig-writer-image
      icr.io/obs/codeengine/tekton-pipeline/kubeconfigwriter-3d37fea0b053ea82d66b7c0bae03dcb0:v0.14.2-rc2@sha256:d9163204ebd1f9b8d7bbafd888e9b2d661834dfda97d02002ef964b538fbc803
      -creds-image
      icr.io/obs/codeengine/tekton-pipeline/creds-init-c761f275af7b3d8bea9d50cc6cb0106f:v0.14.2-rc2@sha256:2d3fca0f61c115ba1c092d49fa328012f245d1a041467f4d34ee409b17537cfe
      -git-image
      icr.io/obs/codeengine/tekton-pipeline/git-init-4874978a9786b6625dd8b6ef2a21aa70:v0.14.2-rc2@sha256:aed72cf82ad06aedd4d185334cc4b2790e074626064ea1517e46429c7540a2eb
      -entrypoint-image
      icr.io/obs/codeengine/tekton-pipeline/entrypoint-bff0a22da108bc2f16c818c97641a296:v0.14.2-rc2@sha256:3bce35f04e04e74a539b7511bbd8db00bad4ffb8698aca65d1fb8e48db8e958a
      -imagedigest-exporter-image
      icr.io/obs/codeengine/tekton-pipeline/imagedigestexporter-6e7c518e6125f31761ebe0b96cc63971:v0.14.2-rc2@sha256:3174897711d6dc697834ebf8bf5ab79aaf1b68ab0922804999199f5fab08276c
      -pr-image
      icr.io/obs/codeengine/tekton-pipeline/pullrequest-init-4e60f6acf9725cba4c9b0c81d0ba89b8:v0.14.2-rc2@sha256:fc8589362d32095dd25fd4200174fc9b050b704d16c30159058ff89f8613ed2f
      -build-gcs-fetcher-image
      icr.io/obs/codeengine/tekton-pipeline/gcs-fetcher-029518c065a5d298216f115c6595f133:v0.14.2-rc2@sha256:ce5cf198fdc17ddd4c09666b34f4c7a9becd89fe6b97be2a99ae880c772f55af
      -affinity-assistant-image
      nginx@sha256:c870bf53de0357813af37b9500cb1c2ff9fb4c00120d5fe1d75c21591293c34d
      -nop-image
      tianon/true@sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b
      -gsutil-image
      google/cloud-sdk@sha256:37654ada9b7afbc32828b537030e85de672a9dd468ac5c92a36da1e203a98def
      -shell-image
      gcr.io/distroless/base@sha256:f79e093f9ba639c957ee857b1ad57ae5046c328998bf8f72b30081db4d8edbe4
    State:          Running
      Started:      Tue, 18 Aug 2020 04:49:51 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  2Gi
    Requests:
      cpu:      500m
      memory:   512Mi
    Liveness:   http-get http://:9090/metrics delay=5s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:9090/metrics delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      SYSTEM_NAMESPACE:             tekton-pipelines (v1:metadata.namespace)
      CONFIG_LOGGING_NAME:          config-logging
      CONFIG_OBSERVABILITY_NAME:    config-observability
      CONFIG_ARTIFACT_BUCKET_NAME:  config-artifact-bucket
      CONFIG_ARTIFACT_PVC_NAME:     config-artifact-pvc
      CONFIG_FEATURE_FLAGS_NAME:    feature-flags
      CONFIG_LEADERELECTION_NAME:   config-leader-election
      METRICS_DOMAIN:               tekton.dev/pipeline
    Mounts:
      /etc/config-logging from config-logging (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from tekton-pipelines-controller-token-zls7s (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-logging:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      config-logging
    Optional:  false
  tekton-pipelines-controller-token-zls7s:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tekton-pipelines-controller-token-zls7s
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 600s
                 node.kubernetes.io/unreachable:NoExecute for 600s
Events:          <none>

So I think previous failure is expected, Because pod is restarting, do liveness and readiness detect failed. Once pod is ready, liveness and readiness detections are normal.

But it's better for your side to add an explicit probes for controller. I think it's necessary.

imjasonh · 2020-08-19T10:53:10Z

@mattmoor is there a health check enpoint provided by knative/pkg by any chance? Perhaps hooked into any signal that reconciling is happening, and into the new HA stuff?

mattmoor · 2020-08-19T14:17:30Z

Yeah, but I think it's probably only exposed when the webhook is active, since the main use case would be lame ducking webhooks so the K8s Service drops the endpoint before it quits.

qu1queee · 2020-08-19T14:57:14Z

@imjasonh any comments around using the /metrics endpoint as the probe?

imjasonh · 2020-08-19T15:00:59Z

@imjasonh any comments around using the /metrics endpoint as the probe?

It's honestly a bit surprising that it reports unhealthy in the example above. I'd want to look into that and figure out why that is. Maybe for simplicity we should add a new /health handler that simply always responds successfully, to remove potential noise.

That seems a bit less useful though, since ideally we'd like to only report "ready" when listers/informers are set up, or when the webhook is registered, and should probably have some way to programmatically be able to report unhealthy/non-live. That's why I roped in @mattmoor, in case these considerations have already come up and been solved in Knative-land.

xiujuan95 · 2020-08-21T10:27:25Z

@imjasonh Thanks for your attention!

@imjasonh any comments around using the /metrics endpoint as the probe?

It's honestly a bit surprising that it reports unhealthy in the example above. I'd want to look into that and figure out why that is. Maybe for simplicity we should add a new /health handler that simply always responds successfully, to remove potential noise.

It reports unhealthy, I think it's because the pod is restarting. Once the pod is ready, then probes will be successful. You can see here.
Yes, I agree with you to add a simple /health endpoint to detect the liveness and readiness. Please consider it, thanks a lot!

zhangtbj · 2020-08-25T04:43:01Z

And this liveness and readiness is also required for tekton pipeline webook and tekton trigger controller and webhook.

ywluogg · 2020-09-09T15:10:00Z

This sounds the same to the request in #1586. I tried adding a port in the controller (commit) but still didn't work. Still trying adding this to controller. The two probes are just added to webhook via this commit.

xiujuan95 · 2020-09-10T02:22:30Z

@ywluogg Why do you use 8080 port instead of 9090?

ywluogg · 2020-09-10T12:17:06Z

@xiujuan95 Ah that's because I wanted to separate the probes' ports from metrics port.

I'm able to add the probes and the pending changes are in: 1d0f3d3

But as @imjasonh mentioned it seems more useful that if we can connect probes' ports to a signal that can clearly tells the controller is processing reconciliations, which needs much more changes. The current controller setup is a single controller replica and its time to restart itself when it crashes and the time it restarts itself using probes are roughly the same. Are you trying to use the probes for some other goals?

I'm going to wait for the discussion about this and then probably send a PR.

xiujuan95 · 2020-09-14T02:15:48Z

Thanks @ywluogg ! No, I just want to use probes to detect the health of controller, don't have other goals.

ywluogg · 2020-09-14T03:26:38Z

@imjasonh do you think we should add the probes for simple health check purposes for now?

qu1queee · 2020-09-17T18:29:32Z

any updates on this issue? It seems that if we use the /metrics endpoint for probes it will eventually conflict with HA for the controllers, as explained in #2735 (comment).

afrittoli · 2020-10-05T12:46:00Z

The probes are available now on the webhook:

pipeline/config/webhook.yaml

Lines 93 to 103 in cfe2fe0

    
           livenessProbe: 
        
             tcpSocket: 
        
               port: https-webhook 
        
             initialDelaySeconds: 5 
        
             periodSeconds: 10 
        
             timeoutSeconds: 5 
        
           readinessProbe: 
        
             tcpSocket: 
        
               port: https-webhook 
        
             initialDelaySeconds: 5 
        
             periodSeconds: 10

afrittoli · 2020-10-05T13:03:03Z

Yeah, but I think it's probably only exposed when the webhook is active, since the main use case would be lame ducking webhooks so the K8s Service drops the endpoint before it quits.

@mattmoor @pmorie I found knative/pkg#1048 but it's not clear to me then whether it is available yet or not. If not we could perhaps add a "shallow" /healthz for now, that always reports "ok" (as @imjasonh suggested) and switch to the knative/pkg one once it becomes available.

ywluogg · 2020-10-06T19:21:14Z

Agreed with @afrittoli. It seems still more suitable if we add the health checks after knative/pkg#1048 becomes available.

xiujuan95 · 2020-11-03T06:57:48Z

Hi, @ywluogg any updates about adding livenss/readiness probes for controller?

xiujuan95 added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 18, 2020

xiujuan95 mentioned this issue Sep 14, 2020

Support controller leader election in tekton-pipeline #2735

Closed

ywluogg mentioned this issue Oct 16, 2020

Switch webhook liveness/readiness probes to use http ports #3349

Merged

4 tasks

ywluogg mentioned this issue Nov 3, 2020

Add readiness and liveness probes in controller #3489

Merged

4 tasks

tekton-robot closed this as completed in #3489 Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a tekton pipeline user, I want to use liveness/readiness probes to check controller pod health #3111

As a tekton pipeline user, I want to use liveness/readiness probes to check controller pod health #3111

xiujuan95 commented Aug 18, 2020 •

edited

Loading

xiujuan95 commented Aug 19, 2020 •

edited

Loading

imjasonh commented Aug 19, 2020

mattmoor commented Aug 19, 2020

qu1queee commented Aug 19, 2020

imjasonh commented Aug 19, 2020

xiujuan95 commented Aug 21, 2020

zhangtbj commented Aug 25, 2020

ywluogg commented Sep 9, 2020

xiujuan95 commented Sep 10, 2020

ywluogg commented Sep 10, 2020 •

edited

Loading

xiujuan95 commented Sep 14, 2020

ywluogg commented Sep 14, 2020

qu1queee commented Sep 17, 2020

afrittoli commented Oct 5, 2020

afrittoli commented Oct 5, 2020

ywluogg commented Oct 6, 2020

xiujuan95 commented Nov 3, 2020

As a tekton pipeline user, I want to use liveness/readiness probes to check controller pod health #3111

As a tekton pipeline user, I want to use liveness/readiness probes to check controller pod health #3111

Comments

xiujuan95 commented Aug 18, 2020 • edited Loading

Feature request

Use case

xiujuan95 commented Aug 19, 2020 • edited Loading

imjasonh commented Aug 19, 2020

mattmoor commented Aug 19, 2020

qu1queee commented Aug 19, 2020

imjasonh commented Aug 19, 2020

xiujuan95 commented Aug 21, 2020

zhangtbj commented Aug 25, 2020

ywluogg commented Sep 9, 2020

xiujuan95 commented Sep 10, 2020

ywluogg commented Sep 10, 2020 • edited Loading

xiujuan95 commented Sep 14, 2020

ywluogg commented Sep 14, 2020

qu1queee commented Sep 17, 2020

afrittoli commented Oct 5, 2020

afrittoli commented Oct 5, 2020

ywluogg commented Oct 6, 2020

xiujuan95 commented Nov 3, 2020

xiujuan95 commented Aug 18, 2020 •

edited

Loading

xiujuan95 commented Aug 19, 2020 •

edited

Loading

ywluogg commented Sep 10, 2020 •

edited

Loading