Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler does not support custom scheduling config #4518

Closed
ialidzhikov opened this issue Dec 13, 2021 · 9 comments
Closed

cluster-autoscaler does not support custom scheduling config #4518

ialidzhikov opened this issue Dec 13, 2021 · 9 comments
Labels
area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ialidzhikov
Copy link
Contributor

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Not applicable

What k8s version are you using (kubectl version)?:

v1.22.2

What environment is this in?:

Gardener

What did you expect to happen?:

cluster-autoscaler to support custom scheduling config that is configurable (or cluster-autoscaler to rework the existing mechanism with the similator and the hard-coded default scheduling config).

What happened instead?:

The default scheduling algorithm in kube-scheduler is to spread Pods accross Nodes. To improve the utilization of our Nodes, we would like to run the kube-scheduler with a custom configuration that improves the Nodes utilizations by selecting the most allocated Node.
However with #4517 (and also reading the code and existing issues) we see that cluster-autoscaler internally vendors the kube-scheduler pkgs and runs a simulation whether a Pod can be scheduled. We see that the similator is using the default scheduling config and currently there is no way to run the autoscaler with custom scheduling config. As you may already guess, discrepancies and issues arise when kube-sheduler and cluster-autorscaler run with different scheduling configs - we may easily end up in a situation where a Pod is unschedulable according to the kube-scheduler but it is schedulable according to the cluster-autoscaler -> cluster-autorscaler rejects to scale up the Nodes count.

How to reproduce it (as minimally and precisely as possible):

  1. Run kube-scheduler with a custom scheduling config

  2. Make sure that issues will occur as described above because the cluster-autoscaler is using the default scheduling config

@ialidzhikov ialidzhikov added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2021
@MaciekPytel
Copy link
Contributor

The specific use-case (changing scheduler preferences regarding node utilization) should work fine with CA as is. Cluster Autoscaler only runs scheduler Filters in simulation. It completely ignores Scores. In other words CA only simulates "hard" scheduling requirements (ex. whether node has resources, requiredDuringScheduling affinities) and completely ignores any preferences (ex. more/less utilized nodes, preferredDuringScheduling affinities, podTopologySpreading with ScheduleAnyway set).

The change that you described above only changes scheduler preferences, which CA doesn't take into account anyway and so it shouldn't conflict with CA.

Supporting scheduler config is a fairly significant feature request. It's also unclear how useful it would be, given that:

  1. Only Filters config is relevant to CA.
  2. Adding any custom Filter that is not part of k8s codebase would require recompiling CA anyway.
  3. I'm not aware of any common use-case for tweaking config of default Filters.

Conceptually, this feature makes sense and we'd be happy to accept a contribution, but it's not a high priority for us I think given the above.

Finally, nit: this should not be kind/bug. CA explicitly only supports default scheduler and doesn't have any feature that would allow customizing scheduler config. Lack of feature is not a bug.

@MaciekPytel MaciekPytel added area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Dec 13, 2021
@ialidzhikov
Copy link
Contributor Author

Thanks for the reply @MaciekPytel . Happy to see that, in theory, custom scheduling config regarding node utilization should work fine with the cluster-autoscaler. Let me try it out.

@t0rr3sp3dr0
Copy link

@MaciekPytel, would it be possible to make CA not ignore the scheduling preferences you listed? I'm expecting a problem related to it when using topologySpreadConstraints with ScheduleAnyway.

I have a cluster on AWS with node groups in two AZs and my deployments use topologySpreadConstraints with ScheduleAnyway to have a balanced number of replicas between AZs. It needs to be configured with ScheduleAnyway so in case an AZ goes down, it reschedules the pods to the other AZ, keeping the number of total replicas of the service.

The problem is that CA is scaling down nodes without considering the pods with ScheduleAnyway, so one AZ ends up with more nodes and replicas of services than the other one. Sometimes all replicas of a deployment go to a single AZ due to this behavior. In case of an AZ outage, it would cause these services to have downtime.

@t0rr3sp3dr0
Copy link

I've oversimplified the description of my setup on last message, but I think it's enough to understand the problem. Anyway, I'll give you some extra detail here.

To ensure I always have available space on nodes of both AZs for the scheduler to assign my pods to, I have an overprovisioning cronjob that runs periodically and creates a pod that completes instantly but requests all allocatable space for the node type that I have on that node group. This effectively makes CA to scale up the node group every time I don't have an empty node on it. With that, pods with topologySpreadConstraints set to ScheduleAnyway can be assigned to the preferred AZ up to the space available on that spare node.

It's possible that the overprovisioning didn't run it time or the scale up took too long and replicas are now unbalanced between AZs. To fix that, I use kubernetes-sigs/descheduler to evict pods violating the topologySpreadConstraints, even the ones with ScheduleAnyway. Together with the overprovisioning cronjob, this will eventually rebalance all replicas between AZs.

Then comes the CA problem. It looks at the cluster state without considering scheduling preferences and concludes there are too many nodes on the cluster and that it can reallocate pods to reduce the total number of nodes. It performs the scale down operation and now the replicas are unbalanced.

Now overprovisioning and descheduler start to fight against CA, causing an infinite loop of scale ups and scale downs on the cluster. A high number of evictions start to happen, degrading the performance of services caught in this reallocation battle.

If somehow we could tell CA to consider the soft scheduling constraints for scale downs, it would solve this problem.

@MaciekPytel
Copy link
Contributor

It's not a simple switch we could flip unfortunately. We originally decided to only run Filters(), because pod preferences completely don't fit into First-fit binpacking algorithm CA uses for scale-up. Also CA runs a lot of scheduler simulations and it's hard enough to get CA to work in large clusters running just the Filters(). Adding Scores() would significantly increase the amount of computation required.

I still don't know how to fix either of those issues and even if I did, we'd have to rewrite a lot of CA to support Scores(). So, I think it's very unlikely we'll ever do this. I think the best suggestion I have for your use-case is to implement your own version of https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodes/types.go#L39. The interface is meant as an extension point for customizing CA behavior and it allows you to choose the order in which nodes will be scaled-down.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 14, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants