Short service interruption #516
Closed
moabu
started this conversation in
Show and tell
Replies: 2 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Scenario:
We run gluu on EKS behind an ingress-nginx controller. Oxauth and scim Deployments are running with min 2 replicas each.
Predictably, we observe short service interruptions in all our stages. This happens every time a pod gets terminated, e.g. due to helm upgrades, deployment restarts, or node draining operations.
This service interruption is reproducible, e.g. with:
ab -c 2 -n 10000 -k "https://<idp_host>/.well-known/openid-configuration
kubectl rollout restart deployment gluu-oxauth
Errors can be seen from ingress-nginx logs, as well as in ab tool; here an excerpt from our test setup ingress-nginx logs:
From my understanding, these errors are due to the way k8s handles a pod's lifecycle:
For my understanding, there is some kind of a race condition here: ingress-nginx watches endpoint objects and removes the pod's address from upstream loadbalancing; but this is not fast enough to prevent requests from reaching pods while/after they got SIGTERM signal.
As a possible solution, I found that adding a short sleep interval (5 seconds) to the pod's preStop lifecycle hook, no error's occur; possibly, because ingress-nginx then has enough time to remove the terminating pod from it's upstream pool.
with preStop hook, our oxauth Deployment manifest looks like this (excerpt):
Beta Was this translation helpful? Give feedback.
All reactions