-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler does not support custom scheduling config #4518
Comments
The specific use-case (changing scheduler preferences regarding node utilization) should work fine with CA as is. Cluster Autoscaler only runs scheduler Filters in simulation. It completely ignores Scores. In other words CA only simulates "hard" scheduling requirements (ex. whether node has resources, requiredDuringScheduling affinities) and completely ignores any preferences (ex. more/less utilized nodes, preferredDuringScheduling affinities, podTopologySpreading with ScheduleAnyway set). The change that you described above only changes scheduler preferences, which CA doesn't take into account anyway and so it shouldn't conflict with CA. Supporting scheduler config is a fairly significant feature request. It's also unclear how useful it would be, given that:
Conceptually, this feature makes sense and we'd be happy to accept a contribution, but it's not a high priority for us I think given the above. Finally, nit: this should not be kind/bug. CA explicitly only supports default scheduler and doesn't have any feature that would allow customizing scheduler config. Lack of feature is not a bug. |
Thanks for the reply @MaciekPytel . Happy to see that, in theory, custom scheduling config regarding node utilization should work fine with the cluster-autoscaler. Let me try it out. |
@MaciekPytel, would it be possible to make CA not ignore the scheduling preferences you listed? I'm expecting a problem related to it when using topologySpreadConstraints with ScheduleAnyway. I have a cluster on AWS with node groups in two AZs and my deployments use topologySpreadConstraints with ScheduleAnyway to have a balanced number of replicas between AZs. It needs to be configured with ScheduleAnyway so in case an AZ goes down, it reschedules the pods to the other AZ, keeping the number of total replicas of the service. The problem is that CA is scaling down nodes without considering the pods with ScheduleAnyway, so one AZ ends up with more nodes and replicas of services than the other one. Sometimes all replicas of a deployment go to a single AZ due to this behavior. In case of an AZ outage, it would cause these services to have downtime. |
I've oversimplified the description of my setup on last message, but I think it's enough to understand the problem. Anyway, I'll give you some extra detail here. To ensure I always have available space on nodes of both AZs for the scheduler to assign my pods to, I have an overprovisioning cronjob that runs periodically and creates a pod that completes instantly but requests all allocatable space for the node type that I have on that node group. This effectively makes CA to scale up the node group every time I don't have an empty node on it. With that, pods with topologySpreadConstraints set to ScheduleAnyway can be assigned to the preferred AZ up to the space available on that spare node. It's possible that the overprovisioning didn't run it time or the scale up took too long and replicas are now unbalanced between AZs. To fix that, I use kubernetes-sigs/descheduler to evict pods violating the topologySpreadConstraints, even the ones with ScheduleAnyway. Together with the overprovisioning cronjob, this will eventually rebalance all replicas between AZs. Then comes the CA problem. It looks at the cluster state without considering scheduling preferences and concludes there are too many nodes on the cluster and that it can reallocate pods to reduce the total number of nodes. It performs the scale down operation and now the replicas are unbalanced. Now overprovisioning and descheduler start to fight against CA, causing an infinite loop of scale ups and scale downs on the cluster. A high number of evictions start to happen, degrading the performance of services caught in this reallocation battle. If somehow we could tell CA to consider the soft scheduling constraints for scale downs, it would solve this problem. |
It's not a simple switch we could flip unfortunately. We originally decided to only run Filters(), because pod preferences completely don't fit into First-fit binpacking algorithm CA uses for scale-up. Also CA runs a lot of scheduler simulations and it's hard enough to get CA to work in large clusters running just the Filters(). Adding Scores() would significantly increase the amount of computation required. I still don't know how to fix either of those issues and even if I did, we'd have to rewrite a lot of CA to support Scores(). So, I think it's very unlikely we'll ever do this. I think the best suggestion I have for your use-case is to implement your own version of https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodes/types.go#L39. The interface is meant as an extension point for customizing CA behavior and it allows you to choose the order in which nodes will be scaled-down. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Not applicable
What k8s version are you using (
kubectl version
)?:v1.22.2
What environment is this in?:
Gardener
What did you expect to happen?:
cluster-autoscaler to support custom scheduling config that is configurable (or cluster-autoscaler to rework the existing mechanism with the similator and the hard-coded default scheduling config).
What happened instead?:
The default scheduling algorithm in kube-scheduler is to spread Pods accross Nodes. To improve the utilization of our Nodes, we would like to run the kube-scheduler with a custom configuration that improves the Nodes utilizations by selecting the most allocated Node.
However with #4517 (and also reading the code and existing issues) we see that cluster-autoscaler internally vendors the kube-scheduler pkgs and runs a simulation whether a Pod can be scheduled. We see that the similator is using the default scheduling config and currently there is no way to run the autoscaler with custom scheduling config. As you may already guess, discrepancies and issues arise when kube-sheduler and cluster-autorscaler run with different scheduling configs - we may easily end up in a situation where a Pod is unschedulable according to the kube-scheduler but it is schedulable according to the cluster-autoscaler -> cluster-autorscaler rejects to scale up the Nodes count.
How to reproduce it (as minimally and precisely as possible):
Run kube-scheduler with a custom scheduling config
Make sure that issues will occur as described above because the cluster-autoscaler is using the default scheduling config
The text was updated successfully, but these errors were encountered: