Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

Merged
merged 12 commits into from
Feb 5, 2025

Conversation

ryanaoleary
Copy link
Contributor

Why are these changes needed?

This PR updates min_workers and max_workers in the autoscaler available_node_types to account for the value of numOfHosts, defaulting to 1 when this value is not set. This doesn't block multi-host autoscaling currently, since you can just set the value of minReplicas and maxReplicas to the desired number of multi-host workers, but this change would be helpful to avoid unexpected behavior for users.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ryanaoleary ryanaoleary requested review from hongchaodeng and a team as code owners October 23, 2024 09:02
Copy link
Contributor

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ryanaoleary
Copy link
Contributor Author

@hongchaodeng is a code owner able to review/merge this for me?

@jcotant1 jcotant1 added the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Nov 18, 2024
@ryanaoleary
Copy link
Contributor Author

ryanaoleary commented Jan 16, 2025

Related Issue:

#2600

@ryanaoleary
Copy link
Contributor Author

Closes #2820

Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please properly rebase.

@ryanaoleary
Copy link
Contributor Author

please properly rebase.

Oh sorry about that, I fixed the improper rebase, not sure why those other commits got added to the PR.

@aslonnie aslonnie removed request for a team, sven1977 and simonsays1980 January 30, 2025 19:55
@ryanaoleary ryanaoleary requested a review from aslonnie January 31, 2025 16:38
@kevin85421 kevin85421 self-assigned this Feb 4, 2025
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if maxReplicas is less than replicas * numOfHosts? I guess the Ray Autoscaler will terminate the additional Pods?

@@ -109,7 +109,7 @@ def _get_basic_autoscaling_config() -> dict:
# Same as "small-group" with a TPU resource entry added
# and modified max_workers and node_config.
"tpu-group": {
"max_workers": 4,
"max_workers": 8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max_workers in the autoscaling config will now equal numOfHosts * maxReplicas, so for the TPU group it's 8 rather than 4.

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Feb 5, 2025
@ryanaoleary
Copy link
Contributor Author

What happens if maxReplicas is less than replicas * numOfHosts? I guess the Ray Autoscaler will terminate the additional Pods?

I think that's currently the behavior of the autoscaler, i.e. you have to set maxReplicas to a higher value than replicas * numOfHosts or it will loop deleting workers and then scaling them back up. The autoscaler treats the number of workers as the number of replicas when checking the maxReplicas limit, which isn't the case for multi-host groups. Without this change, the autoscaler will delete multi-host replicas even when replicas < maxReplicas since it sees more workers than the maxReplicas value.

@aslonnie
Copy link
Collaborator

aslonnie commented Feb 5, 2025

@kevin85421 any more comments or concerns? is this PR ready to merge?

@kevin85421
Copy link
Member

@aslonnie I chatted with @ryanaoleary offline. This PR is ready to merge.

@jjyao jjyao merged commit e3680f7 into ray-project:master Feb 5, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants