Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

ryanaoleary · 2024-10-23T09:02:14Z

Why are these changes needed?

This PR updates min_workers and max_workers in the autoscaler available_node_types to account for the value of numOfHosts, defaulting to 1 when this value is not set. This doesn't block multi-host autoscaling currently, since you can just set the value of minReplicas and maxReplicas to the desired number of multi-host workers, but this change would be helpful to avoid unexpected behavior for users.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Ryan O'Leary <[email protected]>

andrewsykim

LGTM

python/ray/autoscaler/_private/kuberay/autoscaling_config.py

ryanaoleary · 2024-11-15T19:57:01Z

@hongchaodeng is a code owner able to review/merge this for me?

Signed-off-by: ryanaoleary <[email protected]>

ryanaoleary · 2025-01-16T14:18:56Z

Related Issue:

#2600

ryanaoleary · 2025-01-27T16:42:29Z

Closes #2820

aslonnie

please properly rebase.

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary · 2025-01-30T14:51:31Z

please properly rebase.

Oh sorry about that, I fixed the improper rebase, not sure why those other commits got added to the PR.

Signed-off-by: Ryan O'Leary <[email protected]>

kevin85421

What happens if maxReplicas is less than replicas * numOfHosts? I guess the Ray Autoscaler will terminate the additional Pods?

kevin85421 · 2025-02-05T04:27:52Z

python/ray/tests/kuberay/test_autoscaling_config.py

@@ -109,7 +109,7 @@ def _get_basic_autoscaling_config() -> dict:
            # Same as "small-group" with a TPU resource entry added
            # and modified max_workers and node_config.
            "tpu-group": {
-                "max_workers": 4,
+                "max_workers": 8,


why do we need this change?

The max_workers in the autoscaling config will now equal numOfHosts * maxReplicas, so for the TPU group it's 8 rather than 4.

ryanaoleary · 2025-02-05T18:22:18Z

What happens if maxReplicas is less than replicas * numOfHosts? I guess the Ray Autoscaler will terminate the additional Pods?

I think that's currently the behavior of the autoscaler, i.e. you have to set maxReplicas to a higher value than replicas * numOfHosts or it will loop deleting workers and then scaling them back up. The autoscaler treats the number of workers as the number of replicas when checking the maxReplicas limit, which isn't the case for multi-host groups. Without this change, the autoscaler will delete multi-host replicas even when replicas < maxReplicas since it sees more workers than the maxReplicas value.

aslonnie · 2025-02-05T18:54:43Z

@kevin85421 any more comments or concerns? is this PR ready to merge?

kevin85421 · 2025-02-05T21:27:18Z

@aslonnie I chatted with @ryanaoleary offline. This PR is ready to merge.

KubeRay Autoscaler min/max workers use NumOfHosts

9861b41

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary requested review from hongchaodeng and a team as code owners October 23, 2024 09:02

Merge branch 'master' into tpu-max-workers

185b759

andrewsykim approved these changes Nov 1, 2024

View reviewed changes

python/ray/autoscaler/_private/kuberay/autoscaling_config.py Show resolved Hide resolved

Merge branch 'master' into tpu-max-workers

6ca79f0

jcotant1 added the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Nov 18, 2024

Merge branch 'master' into tpu-max-workers

d1c983b

Signed-off-by: ryanaoleary <[email protected]>

Merge branch 'master' into tpu-max-workers

c8f3e2e

ryanaoleary mentioned this pull request Jan 27, 2025

[Bug] Kuberay autoscaler should use numOfHosts to calculate max workers ray-project/kuberay#2820

Open

2 tasks

Merge branch 'master' into tpu-max-workers

a9a1aa7

ryanaoleary requested review from a team, sven1977 and simonsays1980 as code owners January 29, 2025 09:05

aslonnie reviewed Jan 30, 2025

View reviewed changes

ryanaoleary added 2 commits January 30, 2025 14:43

Fix test

6bb7231

Signed-off-by: Ryan O'Leary <[email protected]>

KubeRay Autoscaler min/max workers use NumOfHosts

0182cf4

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary force-pushed the tpu-max-workers branch from 700abff to 0182cf4 Compare January 30, 2025 14:44

Merge branch 'master' into tpu-max-workers

0f8e2f7

aslonnie removed request for a team, sven1977 and simonsays1980 January 30, 2025 19:55

Remove repeated line that got added in merge

d29b617

Signed-off-by: Ryan O'Leary <[email protected]>

Merge branch 'master' into tpu-max-workers

52530d2

ryanaoleary requested a review from aslonnie January 31, 2025 16:38

kevin85421 self-assigned this Feb 4, 2025

kevin85421 approved these changes Feb 5, 2025

View reviewed changes

kevin85421 added the go add ONLY when ready to merge, run all tests label Feb 5, 2025

Merge branch 'master' into tpu-max-workers

d20cf8a

jjyao merged commit e3680f7 into ray-project:master Feb 5, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

ryanaoleary commented Oct 23, 2024

andrewsykim left a comment

ryanaoleary commented Nov 15, 2024

ryanaoleary commented Jan 16, 2025 •

edited

Loading

ryanaoleary commented Jan 27, 2025

aslonnie left a comment

ryanaoleary commented Jan 30, 2025

kevin85421 left a comment

kevin85421 Feb 5, 2025

ryanaoleary Feb 5, 2025

ryanaoleary commented Feb 5, 2025

aslonnie commented Feb 5, 2025

kevin85421 commented Feb 5, 2025

Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

Update KubeRay Autoscaler to use NumOfHosts for min/max workers #48212

Conversation

ryanaoleary commented Oct 23, 2024

Why are these changes needed?

Related issue number

Checks

andrewsykim left a comment

Choose a reason for hiding this comment

ryanaoleary commented Nov 15, 2024

ryanaoleary commented Jan 16, 2025 • edited Loading

ryanaoleary commented Jan 27, 2025

aslonnie left a comment

Choose a reason for hiding this comment

ryanaoleary commented Jan 30, 2025

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 Feb 5, 2025

Choose a reason for hiding this comment

ryanaoleary Feb 5, 2025

Choose a reason for hiding this comment

ryanaoleary commented Feb 5, 2025

aslonnie commented Feb 5, 2025

kevin85421 commented Feb 5, 2025

ryanaoleary commented Jan 16, 2025 •

edited

Loading