Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anti-affinity seems not working with kubectl drain #152

Closed
sumit-bm opened this issue Apr 1, 2019 · 3 comments
Closed

Anti-affinity seems not working with kubectl drain #152

sumit-bm opened this issue Apr 1, 2019 · 3 comments
Labels
kind/bug Something isn't working status/invalid This doesn't seem right

Comments

@sumit-bm
Copy link

sumit-bm commented Apr 1, 2019

While trying fault injection and scheduled maintenance scenarios with moderate IO (transactions with single reader and writter and benchmark with 5 streams (10 reader, 10 writter, 1000 Byte Size and 100 events per sec with 10 segments each), experienced Pod anti-affinity seems not working effectively to prevent scheduling more than one instance of a particular component to the same node. For example, in following PKS cluster cluster tried kubectl drain to simulate schedule maintenance of a node pertaining to cluster.

Environment details: PKS / K8 with medium cluster:

3 master nodes @ large.cpu (4 CPU, 4 GB Ram, 16 GB Disk)
5 worker nodes @ xlarge.cpu(8 cpu, 8 GB Ram, 32 GB Disk)
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend

Before kubectl drain nodes and Pods distribution was as follows:

[root@manager1 ~]# kubectl get nodes
NAME                                   STATUS    ROLES     AGE       VERSION
20050ce9-a7a5-4d9c-9e2f-edd3e32a74ba   Ready     <none>    1d        v1.12.4
6d582fdd-f02b-4e3a-a767-98ce0ac82650   Ready     <none>    1d        v1.12.4
831d9e0d-559e-47a7-bbd5-77eeff9d8c58   Ready     <none>    1d        v1.12.4
e4f2afcd-2201-44d5-a893-4509e35236b6   Ready     <none>    1d        v1.12.4
fb77cf66-5f5f-470b-b1ca-cac9247fe0bb   Ready     <none>    1d        v1.12.4
[root@manager1 ~]# kubectl describe node fb77cf66-5f5f-470b-b1ca-cac9247fe0bb
...
output truncated
...
Non-terminated Pods:         (7 in total)
  Namespace                  Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                          ------------  ----------  ---------------  -------------
  default                    pravega-bookie-0                              1 (12%)       2 (25%)     2Gi (26%)        4Gi (52%)
  default                    pravega-pravega-controller-c67d6b758-gp2ts    1 (12%)       2 (25%)     2Gi (26%)        3Gi (39%)
  kube-system                kubernetes-dashboard-5f4b59b97f-7p82j         50m (0%)      100m (1%)   100Mi (1%)       300Mi (3%)
  kube-system                monitoring-influxdb-cdcf4674-bz54z            0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 event-controller-6c77ddd949-tkkx8             0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 fluent-bit-c2s5x                              0 (0%)        0 (0%)      100Mi (1%)       100Mi (1%)
  pravega-longevity          small-tx-manager-687978df5d-np8ds             0 (0%)        0 (0%)      0 (0%)           0 (0%)
[root@manager1 ~]# kubectl describe node 6d582fdd-f02b-4e3a-a767-98ce0ac82650

Non-terminated Pods:         (8 in total)
  Namespace                  Name                                              CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                              ------------  ----------  ---------------  -------------
  default                    isilon-nfs-client-provisioner-67b7ffff86-4pdwf    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    pravega-bookie-2                                  1 (12%)       2 (25%)     2Gi (26%)        4Gi (52%)
  default                    pravega-pravega-controller-c67d6b758-w7zsx        1 (12%)       2 (25%)     2Gi (26%)        3Gi (39%)
  kube-system                heapster-85647cf566-77q7h                         0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                tiller-deploy-95d654d46-9tg9r                     0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 fluent-bit-vbkhz                                  0 (0%)        0 (0%)      100Mi (1%)       100Mi (1%)
  pks-system                 sink-controller-65595c498b-zmmxx                  0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pravega-longevity          small-tx-worker-76c77bb574-llmjd                  0 (0%)        0 (0%)      0 (0%)           0 (0%)

Initiated kubectl drain on one node:

[root@manager1 ~]# kubectl drain fb77cf66-5f5f-470b-b1ca-cac9247fe0bb --delete-local-data --ignore-daemonsets
node "fb77cf66-5f5f-470b-b1ca-cac9247fe0bb" already cordoned
WARNING: Deleting pods with local storage: kubernetes-dashboard-5f4b59b97f-7p82j, monitoring-influxdb-cdcf4674-bz54z, event-controller-6c77ddd949-tkkx8; Ignoring DaemonSet-managed pods: fluent-bit-c2s5x
pod "monitoring-influxdb-cdcf4674-bz54z" evicted
pod "kubernetes-dashboard-5f4b59b97f-7p82j" evicted
pod "event-controller-6c77ddd949-tkkx8" evicted
pod "pravega-pravega-controller-c67d6b758-gp2ts" evicted
pod "small-tx-manager-687978df5d-np8ds" evicted
pod "pravega-bookie-0" evicted
node "fb77cf66-5f5f-470b-b1ca-cac9247fe0bb" drained
[root@manager1 ~]# kubectl get nodes
NAME                                   STATUS                     ROLES     AGE       VERSION
20050ce9-a7a5-4d9c-9e2f-edd3e32a74ba   Ready                      <none>    1d        v1.12.4
6d582fdd-f02b-4e3a-a767-98ce0ac82650   Ready                      <none>    1d        v1.12.4
831d9e0d-559e-47a7-bbd5-77eeff9d8c58   Ready                      <none>    1d        v1.12.4
e4f2afcd-2201-44d5-a893-4509e35236b6   Ready                      <none>    1d        v1.12.4
fb77cf66-5f5f-470b-b1ca-cac9247fe0bb   Ready,SchedulingDisabled   <none>    1d        v1.12.4
[root@manager1 ~]#

Found that the bookie0 which was running on node is now running on another node where already a bookie process (bookie2) is undergoing.

[root@manager1 ~]# kubectl describe node 6d582fdd-f02b-4e3a-a767-98ce0ac82650

Non-terminated Pods:         (12 in total)
  Namespace                  Name                                              CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                              ------------  ----------  ---------------  -------------
  default                    isilon-nfs-client-provisioner-67b7ffff86-4pdwf    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    pravega-bookie-0                                  1 (12%)       2 (25%)     2Gi (26%)        4Gi (52%)
  default                    pravega-bookie-2                                  1 (12%)       2 (25%)     2Gi (26%)        4Gi (52%)
  default                    pravega-pravega-controller-c67d6b758-w7zsx        1 (12%)       2 (25%)     2Gi (26%)        3Gi (39%)
  kube-system                heapster-85647cf566-77q7h                         0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                monitoring-influxdb-cdcf4674-9sk8c                0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                tiller-deploy-95d654d46-9tg9r                     0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 event-controller-6c77ddd949-54ddm                 0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 fluent-bit-vbkhz                                  0 (0%)        0 (0%)      100Mi (1%)       100Mi (1%)
  pks-system                 sink-controller-65595c498b-zmmxx                  0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pravega-longevity          small-tx-manager-687978df5d-tvgjx                 0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pravega-longevity          small-tx-worker-76c77bb574-llmjd                  0 (0%)        0 (0%)      0 (0%)           0 (0%)

As per anti-affinity rule, eligibility of scheduling a pod seems broken here where still N-1 distinct nodes are present (after drain of one node) and in one of the node bookie is not running. Snip below:

[root@manager1 ~]# kubectl describe node 20050ce9-a7a5-4d9c-9e2f-edd3e32a74ba

Non-terminated Pods:         (6 in total)
  Namespace                  Name                                CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                ------------  ----------  ---------------  -------------
  default                    pravega-benchmark                   0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    pravega-pravega-segmentstore-0      2 (25%)       4 (50%)     4Gi (52%)        6Gi (78%)
  default                    pravega-zk-2                        500m (6%)     1 (12%)     1Gi (13%)        2Gi (26%)
  kube-system                metrics-server-555d98886f-dj5cz     0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 fluent-bit-mnxpb                    0 (0%)        0 (0%)      100Mi (1%)       100Mi (1%)
  pks-system                 telemetry-agent-559f9c8855-26srv    0 (0%)        0 (0%)      0 (0%)           0 (0%)

Bookie-0 pod should have been started on *ba node as there was no Bookie was running, rather than choosing *50 node where already bookie-2 was running.

@sumit-bm sumit-bm changed the title Anti-afinity seems not working with kubectl drain Anti-affinity seems not working with kubectl drain Apr 1, 2019
@sumit-bm sumit-bm added the kind/bug Something isn't working label Apr 1, 2019
@Tristan1900
Copy link
Member

I think this behavior is expected when using PreferredDuringSchedulingIgnoredDuringExecution as our implementation for pod anti-affinity.
This strategy means that k8s will calculate a weighted score for each node when scheduling pods according to multiple built-in priority functions such as LeastRequestedPriority, BalancedResourceAllocation, SelectorSpreadPriority. NodeAffinityPriority is one of those functions. This issue probably happens when other functions score is dominating. We can notice that the *ba node has higher CPU and MEM requests and limits compared to *50 node before bookie0 is scheduled.

@adrianmo adrianmo added the status/invalid This doesn't seem right label Apr 2, 2019
@adrianmo
Copy link
Contributor

adrianmo commented Apr 2, 2019

@Tristan1900 is right. At this moment, we are not enforcing pods of the same kind to be placed on different nodes, it's just a preference that might be altered by other metrics like @Tristan1900 mentioned.

However, it's also true that there might be scenarios in which we want to enforce placement to favor security and partition tolerance over availability. I've created #155 to expose pod affinity and make it configurable.

Closing this issue.

@adrianmo adrianmo closed this as completed Apr 2, 2019
@Tristan1900
Copy link
Member

Thanks @adrianmo for the confirmation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working status/invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants