Anti-affinity seems not working with kubectl drain #152

sumit-bm · 2019-04-01T10:52:16Z

While trying fault injection and scheduled maintenance scenarios with moderate IO (transactions with single reader and writter and benchmark with 5 streams (10 reader, 10 writter, 1000 Byte Size and 100 events per sec with 10 segments each), experienced Pod anti-affinity seems not working effectively to prevent scheduling more than one instance of a particular component to the same node. For example, in following PKS cluster cluster tried kubectl drain to simulate schedule maintenance of a node pertaining to cluster.

Environment details: PKS / K8 with medium cluster:

3 master nodes @ large.cpu (4 CPU, 4 GB Ram, 16 GB Disk)
5 worker nodes @ xlarge.cpu(8 cpu, 8 GB Ram, 32 GB Disk)
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend

Before kubectl drain nodes and Pods distribution was as follows:

[root@manager1 ~]# kubectl get nodes
NAME                                   STATUS    ROLES     AGE       VERSION
20050ce9-a7a5-4d9c-9e2f-edd3e32a74ba   Ready     <none>    1d        v1.12.4
6d582fdd-f02b-4e3a-a767-98ce0ac82650   Ready     <none>    1d        v1.12.4
831d9e0d-559e-47a7-bbd5-77eeff9d8c58   Ready     <none>    1d        v1.12.4
e4f2afcd-2201-44d5-a893-4509e35236b6   Ready     <none>    1d        v1.12.4
fb77cf66-5f5f-470b-b1ca-cac9247fe0bb   Ready     <none>    1d        v1.12.4

[root@manager1 ~]# kubectl describe node fb77cf66-5f5f-470b-b1ca-cac9247fe0bb
...
output truncated
...
Non-terminated Pods:         (7 in total)
  Namespace                  Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                          ------------  ----------  ---------------  -------------
  default                    pravega-bookie-0                              1 (12%)       2 (25%)     2Gi (26%)        4Gi (52%)
  default                    pravega-pravega-controller-c67d6b758-gp2ts    1 (12%)       2 (25%)     2Gi (26%)        3Gi (39%)
  kube-system                kubernetes-dashboard-5f4b59b97f-7p82j         50m (0%)      100m (1%)   100Mi (1%)       300Mi (3%)
  kube-system                monitoring-influxdb-cdcf4674-bz54z            0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 event-controller-6c77ddd949-tkkx8             0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 fluent-bit-c2s5x                              0 (0%)        0 (0%)      100Mi (1%)       100Mi (1%)
  pravega-longevity          small-tx-manager-687978df5d-np8ds             0 (0%)        0 (0%)      0 (0%)           0 (0%)

[root@manager1 ~]# kubectl describe node 6d582fdd-f02b-4e3a-a767-98ce0ac82650

Non-terminated Pods:         (8 in total)
  Namespace                  Name                                              CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                              ------------  ----------  ---------------  -------------
  default                    isilon-nfs-client-provisioner-67b7ffff86-4pdwf    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    pravega-bookie-2                                  1 (12%)       2 (25%)     2Gi (26%)        4Gi (52%)
  default                    pravega-pravega-controller-c67d6b758-w7zsx        1 (12%)       2 (25%)     2Gi (26%)        3Gi (39%)
  kube-system                heapster-85647cf566-77q7h                         0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                tiller-deploy-95d654d46-9tg9r                     0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 fluent-bit-vbkhz                                  0 (0%)        0 (0%)      100Mi (1%)       100Mi (1%)
  pks-system                 sink-controller-65595c498b-zmmxx                  0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pravega-longevity          small-tx-worker-76c77bb574-llmjd                  0 (0%)        0 (0%)      0 (0%)           0 (0%)

Initiated kubectl drain on one node:

[root@manager1 ~]# kubectl drain fb77cf66-5f5f-470b-b1ca-cac9247fe0bb --delete-local-data --ignore-daemonsets
node "fb77cf66-5f5f-470b-b1ca-cac9247fe0bb" already cordoned
WARNING: Deleting pods with local storage: kubernetes-dashboard-5f4b59b97f-7p82j, monitoring-influxdb-cdcf4674-bz54z, event-controller-6c77ddd949-tkkx8; Ignoring DaemonSet-managed pods: fluent-bit-c2s5x
pod "monitoring-influxdb-cdcf4674-bz54z" evicted
pod "kubernetes-dashboard-5f4b59b97f-7p82j" evicted
pod "event-controller-6c77ddd949-tkkx8" evicted
pod "pravega-pravega-controller-c67d6b758-gp2ts" evicted
pod "small-tx-manager-687978df5d-np8ds" evicted
pod "pravega-bookie-0" evicted
node "fb77cf66-5f5f-470b-b1ca-cac9247fe0bb" drained
[root@manager1 ~]# kubectl get nodes
NAME                                   STATUS                     ROLES     AGE       VERSION
20050ce9-a7a5-4d9c-9e2f-edd3e32a74ba   Ready                      <none>    1d        v1.12.4
6d582fdd-f02b-4e3a-a767-98ce0ac82650   Ready                      <none>    1d        v1.12.4
831d9e0d-559e-47a7-bbd5-77eeff9d8c58   Ready                      <none>    1d        v1.12.4
e4f2afcd-2201-44d5-a893-4509e35236b6   Ready                      <none>    1d        v1.12.4
fb77cf66-5f5f-470b-b1ca-cac9247fe0bb   Ready,SchedulingDisabled   <none>    1d        v1.12.4
[root@manager1 ~]#

Found that the bookie0 which was running on node is now running on another node where already a bookie process (bookie2) is undergoing.

[root@manager1 ~]# kubectl describe node 6d582fdd-f02b-4e3a-a767-98ce0ac82650

Non-terminated Pods:         (12 in total)
  Namespace                  Name                                              CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                              ------------  ----------  ---------------  -------------
  default                    isilon-nfs-client-provisioner-67b7ffff86-4pdwf    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    pravega-bookie-0                                  1 (12%)       2 (25%)     2Gi (26%)        4Gi (52%)
  default                    pravega-bookie-2                                  1 (12%)       2 (25%)     2Gi (26%)        4Gi (52%)
  default                    pravega-pravega-controller-c67d6b758-w7zsx        1 (12%)       2 (25%)     2Gi (26%)        3Gi (39%)
  kube-system                heapster-85647cf566-77q7h                         0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                monitoring-influxdb-cdcf4674-9sk8c                0 (0%)        0 (0%)      0 (0%)           0 (0%)
  kube-system                tiller-deploy-95d654d46-9tg9r                     0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 event-controller-6c77ddd949-54ddm                 0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 fluent-bit-vbkhz                                  0 (0%)        0 (0%)      100Mi (1%)       100Mi (1%)
  pks-system                 sink-controller-65595c498b-zmmxx                  0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pravega-longevity          small-tx-manager-687978df5d-tvgjx                 0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pravega-longevity          small-tx-worker-76c77bb574-llmjd                  0 (0%)        0 (0%)      0 (0%)           0 (0%)

As per anti-affinity rule, eligibility of scheduling a pod seems broken here where still N-1 distinct nodes are present (after drain of one node) and in one of the node bookie is not running. Snip below:

[root@manager1 ~]# kubectl describe node 20050ce9-a7a5-4d9c-9e2f-edd3e32a74ba

Non-terminated Pods:         (6 in total)
  Namespace                  Name                                CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                ------------  ----------  ---------------  -------------
  default                    pravega-benchmark                   0 (0%)        0 (0%)      0 (0%)           0 (0%)
  default                    pravega-pravega-segmentstore-0      2 (25%)       4 (50%)     4Gi (52%)        6Gi (78%)
  default                    pravega-zk-2                        500m (6%)     1 (12%)     1Gi (13%)        2Gi (26%)
  kube-system                metrics-server-555d98886f-dj5cz     0 (0%)        0 (0%)      0 (0%)           0 (0%)
  pks-system                 fluent-bit-mnxpb                    0 (0%)        0 (0%)      100Mi (1%)       100Mi (1%)
  pks-system                 telemetry-agent-559f9c8855-26srv    0 (0%)        0 (0%)      0 (0%)           0 (0%)

Bookie-0 pod should have been started on *ba node as there was no Bookie was running, rather than choosing *50 node where already bookie-2 was running.

The text was updated successfully, but these errors were encountered:

Tristan1900 · 2019-04-02T08:00:48Z

I think this behavior is expected when using PreferredDuringSchedulingIgnoredDuringExecution as our implementation for pod anti-affinity.
This strategy means that k8s will calculate a weighted score for each node when scheduling pods according to multiple built-in priority functions such as LeastRequestedPriority, BalancedResourceAllocation, SelectorSpreadPriority. NodeAffinityPriority is one of those functions. This issue probably happens when other functions score is dominating. We can notice that the *ba node has higher CPU and MEM requests and limits compared to *50 node before bookie0 is scheduled.

adrianmo · 2019-04-02T10:11:00Z

@Tristan1900 is right. At this moment, we are not enforcing pods of the same kind to be placed on different nodes, it's just a preference that might be altered by other metrics like @Tristan1900 mentioned.

However, it's also true that there might be scenarios in which we want to enforce placement to favor security and partition tolerance over availability. I've created #155 to expose pod affinity and make it configurable.

Closing this issue.

Tristan1900 · 2019-04-02T20:04:30Z

Thanks @adrianmo for the confirmation!

sumit-bm changed the title ~~Anti-afinity seems not working with kubectl drain~~ Anti-affinity seems not working with kubectl drain Apr 1, 2019

sumit-bm added the kind/bug Something isn't working label Apr 1, 2019

adrianmo added the status/invalid This doesn't seem right label Apr 2, 2019

adrianmo closed this as completed Apr 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anti-affinity seems not working with kubectl drain #152

Anti-affinity seems not working with kubectl drain #152

sumit-bm commented Apr 1, 2019

Tristan1900 commented Apr 2, 2019

adrianmo commented Apr 2, 2019 •

edited

Loading

Tristan1900 commented Apr 2, 2019

Anti-affinity seems not working with kubectl drain #152

Anti-affinity seems not working with kubectl drain #152

Comments

sumit-bm commented Apr 1, 2019

Tristan1900 commented Apr 2, 2019

adrianmo commented Apr 2, 2019 • edited Loading

Tristan1900 commented Apr 2, 2019

adrianmo commented Apr 2, 2019 •

edited

Loading