Short service interruption #516

moabu · 2023-02-09T08:34:49Z

moabu
Feb 9, 2023
Maintainer

Scenario:

We run gluu on EKS behind an ingress-nginx controller. Oxauth and scim Deployments are running with min 2 replicas each.

Predictably, we observe short service interruptions in all our stages. This happens every time a pod gets terminated, e.g. due to helm upgrades, deployment restarts, or node draining operations.

This service interruption is reproducible, e.g. with:

run apache benchmark: ab -c 2 -n 10000 -k "https://<idp_host>/.well-known/openid-configuration
restart oxauth deployment: kubectl rollout restart deployment gluu-oxauth
as expected, deployment will create a new replica set, and terminate old pods every time a new pod gets ready. But, every time a pod terminates, some 503 errors get returned to the client; and ingress-nginx notices upstream connect and timeout errors.

Errors can be seen from ingress-nginx logs, as well as in ab tool; here an excerpt from our test setup ingress-nginx logs:

192.168.183.234 - - [08/Feb/2023:11:00:21 +0000] "GET /.well-known/openid-configuration HTTP/1.0" 503 421 "-" "ApacheBench/2.3" 156 0.005 [test-gluu-oxauth-8080] [] 10.225.124.130:8080 421 0.008 503 b8928c63e04f2148827e2664219a3540
2023/02/08 11:00:21 [error] 35#35: *1704589 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.183.234, server: id-test.k8stest.aws.gluu.test, request: "GET /.well-known/openid-configuration HTTP/1.0", upstream: "http://10.225.124.130:8080/oxauth/.well-known/openid-configuration", host: "id-test.k8stest.aws.gluu.test"
192.168.183.234 - - [08/Feb/2023:11:00:21 +0000] "GET /.well-known/openid-configuration HTTP/1.0" 200 9603 "-" "ApacheBench/2.3" 156 0.003 [test-gluu-oxauth-8080] [] 10.225.124.130:8080, 10.225.124.97:8080 0, 9603 0.000, 0.004 502, 200 eec32abb82fd227636cd02b3abe4a1ce
2023/02/08 11:00:21 [error] 35#35: *1704589 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.183.234, server: id-test.k8stest.aws.gluu.test, request: "GET /.well-known/openid-configuration HTTP/1.0", upstream: "http://10.225.124.130:8080/oxauth/.well-known/openid-configuration", host: "id-test.k8stest.aws.gluu.test"
192.168.183.234 - - [08/Feb/2023:11:00:21 +0000] "GET /.well-known/openid-configuration HTTP/1.0" 200 9603 "-" "ApacheBench/2.3" 156 0.002 [test-gluu-oxauth-8080] [] 10.225.124.130:8080, 10.225.124.97:8080 0, 9603 0.000, 0.000 502, 200 5cc5d64b00b4536d8a96e3e6d964ce95
2023/02/08 11:00:26 [error] 35#35: *1704589 upstream timed out (110: Operation timed out) while connecting to upstream, client: 192.168.183.234, server: id-test.k8stest.aws.gluu.test, request: "GET /.well-known/openid-configuration HTTP/1.0", upstream: "http://10.225.124.130:8080/oxauth/.well-known/openid-configuration", host: "id-test.k8stest.aws.gluu.test"

From my understanding, these errors are due to the way k8s handles a pod's lifecycle:

pod gets updated in k8s API (marked as "terminated", by adding "deletionTimestamp")
control plane removes pod's adress from endpoints/endpointslices objects
at the same time, kubelet triggers SIGTERM signal being sent to the pod
For my understanding, there is some kind of a race condition here: ingress-nginx watches endpoint objects and removes the pod's address from upstream loadbalancing; but this is not fast enough to prevent requests from reaching pods while/after they got SIGTERM signal.

As a possible solution, I found that adding a short sleep interval (5 seconds) to the pod's preStop lifecycle hook, no error's occur; possibly, because ingress-nginx then has enough time to remove the terminating pod from it's upstream pool.

with preStop hook, our oxauth Deployment manifest looks like this (excerpt):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-gluu-oxauth
spec:
  template:
    spec:
      containers:
      - name: oxauth
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - /bin/sleep 5

moabu · 2023-02-09T09:51:45Z

moabu
Feb 9, 2023
Maintainer Author

76e67a9

0 replies

moabu · 2023-02-09T10:01:30Z

moabu
Feb 9, 2023
Maintainer Author

6da0243

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short service interruption #516

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Short service interruption #516

moabu Feb 9, 2023 Maintainer

Replies: 2 comments

moabu Feb 9, 2023 Maintainer Author

moabu Feb 9, 2023 Maintainer Author

moabu
Feb 9, 2023
Maintainer

moabu
Feb 9, 2023
Maintainer Author

moabu
Feb 9, 2023
Maintainer Author