You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying fault injection and scheduled maintenance scenarios with moderate IO (transactions with single reader and writter and benchmark with 5 streams (10 reader, 10 writter, 1000 Byte Size and 100 events per sec with 10 segments each), experienced Pod anti-affinity seems not working effectively to prevent scheduling more than one instance of a particular component to the same node. For example, in following PKS cluster cluster tried kubectl drain to simulate schedule maintenance of a node pertaining to cluster.
Environment details: PKS / K8 with medium cluster:
3 master nodes @ large.cpu (4 CPU, 4 GB Ram, 16 GB Disk)
5 worker nodes @ xlarge.cpu(8 cpu, 8 GB Ram, 32 GB Disk)
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend
Before kubectl drain nodes and Pods distribution was as follows:
[root@manager1 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
20050ce9-a7a5-4d9c-9e2f-edd3e32a74ba Ready <none> 1d v1.12.4
6d582fdd-f02b-4e3a-a767-98ce0ac82650 Ready <none> 1d v1.12.4
831d9e0d-559e-47a7-bbd5-77eeff9d8c58 Ready <none> 1d v1.12.4
e4f2afcd-2201-44d5-a893-4509e35236b6 Ready <none> 1d v1.12.4
fb77cf66-5f5f-470b-b1ca-cac9247fe0bb Ready <none> 1d v1.12.4
As per anti-affinity rule, eligibility of scheduling a pod seems broken here where still N-1 distinct nodes are present (after drain of one node) and in one of the node bookie is not running. Snip below:
Bookie-0 pod should have been started on *ba node as there was no Bookie was running, rather than choosing *50 node where already bookie-2 was running.
The text was updated successfully, but these errors were encountered:
sumit-bm
changed the title
Anti-afinity seems not working with kubectl drain
Anti-affinity seems not working with kubectl drain
Apr 1, 2019
I think this behavior is expected when using PreferredDuringSchedulingIgnoredDuringExecution as our implementation for pod anti-affinity.
This strategy means that k8s will calculate a weighted score for each node when scheduling pods according to multiple built-in priority functions such as LeastRequestedPriority, BalancedResourceAllocation, SelectorSpreadPriority. NodeAffinityPriority is one of those functions. This issue probably happens when other functions score is dominating. We can notice that the *ba node has higher CPU and MEM requests and limits compared to *50 node before bookie0 is scheduled.
@Tristan1900 is right. At this moment, we are not enforcing pods of the same kind to be placed on different nodes, it's just a preference that might be altered by other metrics like @Tristan1900 mentioned.
However, it's also true that there might be scenarios in which we want to enforce placement to favor security and partition tolerance over availability. I've created #155 to expose pod affinity and make it configurable.
While trying fault injection and scheduled maintenance scenarios with moderate IO (transactions with single reader and writter and benchmark with 5 streams (10 reader, 10 writter, 1000 Byte Size and 100 events per sec with 10 segments each), experienced Pod anti-affinity seems not working effectively to prevent scheduling more than one instance of a particular component to the same node. For example, in following PKS cluster cluster tried
kubectl drain
to simulate schedule maintenance of a node pertaining to cluster.Environment details: PKS / K8 with medium cluster:
Before
kubectl drain
nodes and Pods distribution was as follows:Initiated
kubectl drain
on one node:Found that the bookie0 which was running on node is now running on another node where already a bookie process (bookie2) is undergoing.
As per anti-affinity rule, eligibility of scheduling a pod seems broken here where still N-1 distinct nodes are present (after drain of one node) and in one of the node bookie is not running. Snip below:
Bookie-0 pod should have been started on
*ba
node as there was no Bookie was running, rather than choosing*50
node where already bookie-2 was running.The text was updated successfully, but these errors were encountered: