Add Ray Autoscaler to the Flyte-Ray plugin #1937

Yicheng-Lu-llll · 2023-11-06T00:09:52Z

TL;DR

NOTE：The ray CI test failed because we need to first merge flyteorg/flyte#4363 to update flytecli.

Currently, the Flyte-Ray plugin utilizes Rayjob. However, there are cases where Rayjob may require an autoscaler.

After completing a workload with Rayjob, a user might want to retain all the information, logs, past tasks, and actor execution history for a period. As of now, Ray lacks a mechanism to persist these data, necessitating the continuous operation of the Ray cluster even after workload completion. With an autoscaler, the Ray cluster will maintain only the head pod while scaling down all worker pods.
User does not need to pre estimate the need of the resource. Autoscaler will care everything.

So, this PR adds Ray Autoscaler config to the Flyte-Ray plugin. Also see flyteorg/flyte#4363.

btw, This PR adds shutdown_after_job_finishes and ttl_seconds_after_finished.

ttl_seconds_after_finished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
shutdown_after_job_finishes specifies whether the RayCluster should be deleted after the RayJob finishes.

Below is an example:

import typing
import ray
from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
from flytekit import Resources, task, workflow

@ray.remote
def f(x):
    return x * x

ray_config = RayJobConfig(
    head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
    # The behavior will be:
    # 1. Create a head node and 0 worker node.
    # 2. The worker node will be scaled to 2.
    # 3. The worker node will be scaled to 0 and will be terminated later(120s).
    worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=0, min_replicas=0, max_replicas=2)],
    enable_autoscaling=True,
    shutdown_after_job_finishes=True,
    # ttl_seconds_after_finished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
    ttl_seconds_after_finished=120,
)

@task(
    task_config=ray_config,
    requests=Resources(mem="1Gi", cpu="2"),
)
def ray_task() -> int:
    import time

    # Import placement group APIs.
    from ray.util.placement_group import (
        placement_group,
        placement_group_table,
        remove_placement_group,
    )
    from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy

    # use the `placement_group` to trigger autoscaling.
    # each node have two cpus, so this will create 2 workers when executing.
    # Once the task is done, the two workers will be removed.
    pg = placement_group([{"CPU": 2} for i in range(3)])
    ray.get(pg.ready(), timeout=100)

    return 1

@workflow
def ray_workflow() -> int:
    return ray_task()

init(replicas=0), only head:

(flytekit) ubuntu@ip-172-31-2-249:~/flyte/flytekit$ kubectl get pod -n flytesnacks-development
NAME                                                    READY   STATUS    RESTARTS   AGE
f50cd4d1a5ceb40bc9aa-n0-0-raycluster-pbvx9-head-szdgn   2/2     Running   0          3s

task executing(max_replicas=2):

(flytekit) ubuntu@ip-172-31-2-249:~/flyte/flytekit$ kubectl get pod -n flytesnacks-development
NAME                                                      READY   STATUS    RESTARTS   AGE
f50cd4d1a5ceb40bc9aa-n0-0-raycluster-pbvx9-head-szdgn     2/2     Running   0          54s
ceb40bc9aa-n0-0-raycluster-pbvx9-worker-ray-group-llzp7   1/1     Running   0          14s
ceb40bc9aa-n0-0-raycluster-pbvx9-worker-ray-group-mqrb6   1/1     Running   0          14s

finsh(min_replicas=0):

(flytekit) ubuntu@ip-172-31-2-249:~/flyte/flytekit$ kubectl get pod -n flytesnacks-development
NAME                                                      READY   STATUS        RESTARTS   AGE
f50cd4d1a5ceb40bc9aa-n0-0-raycluster-pbvx9-head-szdgn     2/2     Running       0          2m16s
ceb40bc9aa-n0-0-raycluster-pbvx9-worker-ray-group-llzp7   1/1     Terminating   0          96s
ceb40bc9aa-n0-0-raycluster-pbvx9-worker-ray-group-mqrb6   1/1     Terminating   0          96s

After TTL 120s:

(flytekit) ubuntu@ip-172-31-2-249:~/flyte/flytekit$ kubectl get pod -n flytesnacks-development
No resources found in flytesnacks-development namespace.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

flyteorg/flyte#4187

Follow-up issue

NA
OR
https://github.com/flyteorg/flyte/issues/

codecov · 2023-11-06T00:29:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7794d3a) 85.53% compared to head (150ab4d) 85.53%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1937   +/-   ##
=======================================
  Coverage   85.53%   85.53%           
=======================================
  Files         309      309           
  Lines       23460    23475   +15     
  Branches     3630     3630           
=======================================
+ Hits        20066    20079   +13     
- Misses       2752     2753    +1     
- Partials      642      643    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Yicheng-Lu-llll <[email protected]>

pingsutw

LGTM, could we add a small test here?

pingsutw · 2023-12-14T22:44:05Z

plugins/flytekit-ray/flytekitplugins/ray/models.py

@@ -148,13 +152,22 @@ def worker_group_spec(self) -> typing.List[WorkerGroupSpec]:
        """
        return self._worker_group_spec

+    @property
+    def enable_in_tree_autoscaling(self) -> bool:


qq: why in_tree? Does Ray have other autoscaling strategies?

Only have one autoscaling strategy. I have changed name to enable_autoscaling. Thank you!

Signed-off-by: Yicheng-Lu-llll <[email protected]>

eapolinario · 2023-12-22T17:52:53Z

plugins/flytekit-ray/flytekitplugins/ray/models.py

@@ -178,9 +192,13 @@ def __init__(
        self,
        ray_cluster: RayCluster,
        runtime_env: typing.Optional[str],
+        ttl_seconds_after_finished: typing.Optional[int] = None,
+        shutdown_after_job_finishes: bool = False,


does this mean that by default the rayjob will not be reclaimed by kuberay once the job finishes?

According to https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray_v1alpha1_rayjob.yaml#L7, Yes.

eapolinario · 2023-12-22T17:53:15Z

plugins/flytekit-ray/flytekitplugins/ray/models.py

@@ -178,9 +192,13 @@ def __init__(
        self,
        ray_cluster: RayCluster,
        runtime_env: typing.Optional[str],
+        ttl_seconds_after_finished: typing.Optional[int] = None,


this is a no-op if shutdown_after_job_finishes is set to False, right?

eapolinario · 2023-12-26T14:15:14Z

plugins/flytekit-ray/tests/test_ray.py

+    enable_autoscaling=True,
+    shutdown_after_job_finishes=True,
+    ttl_seconds_after_finished=20,


This needs a new release of flyteidl.

Signed-off-by: Yicheng-Lu-llll <[email protected]>

Signed-off-by: Yicheng-Lu-llll <[email protected]> Signed-off-by: Jan Fiedler <[email protected]>

Yicheng-Lu-llll requested review from wild-endeavor, kumare3, eapolinario, pingsutw and cosmicBboy as code owners November 6, 2023 00:09

Yicheng-Lu-llll force-pushed the add-ray-autoscaler-config branch from db8d4c3 to ddfd824 Compare November 6, 2023 01:45

Yicheng-Lu-llll mentioned this pull request Nov 6, 2023

Add Ray Autoscaler to the Flyte-Ray plugin flyteorg/flyte#4363

Merged

3 tasks

Yicheng-Lu-llll force-pushed the add-ray-autoscaler-config branch from 07ee38a to 5de8cf7 Compare November 6, 2023 18:15

add enable_in_tree_autoscaling

3e54e42

Signed-off-by: Yicheng-Lu-llll <[email protected]>

Yicheng-Lu-llll force-pushed the add-ray-autoscaler-config branch from 5de8cf7 to 3e54e42 Compare November 6, 2023 18:21

Yicheng-Lu-llll added 2 commits November 6, 2023 19:06

nit

bcd6bdb

Signed-off-by: Yicheng-Lu-llll <[email protected]>

nit

9d94501

Signed-off-by: Yicheng-Lu-llll <[email protected]>

pingsutw reviewed Dec 14, 2023

View reviewed changes

Yicheng-Lu-llll and others added 4 commits December 22, 2023 01:20

add ttl

74c6287

Signed-off-by: Yicheng-Lu-llll <[email protected]>

fix lint error

bcad2b0

Signed-off-by: Yicheng-Lu-llll <[email protected]>

add ttl test

4b9cd08

Signed-off-by: Yicheng-Lu-llll <[email protected]>

Merge branch 'flyteorg:master' into add-ray-autoscaler-config

ffffef3

eapolinario reviewed Dec 22, 2023

View reviewed changes

eapolinario reviewed Dec 26, 2023

View reviewed changes

Merge branch 'flyteorg:master' into add-ray-autoscaler-config

d8845b9

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Feb 16, 2024

Yicheng-Lu-llll and others added 2 commits February 17, 2024 10:32

Merge branch 'flyteorg:master' into add-ray-autoscaler-config

5943978

fix ray test

150ab4d

Signed-off-by: Yicheng-Lu-llll <[email protected]>

Yicheng-Lu-llll force-pushed the add-ray-autoscaler-config branch from c03ad40 to 150ab4d Compare February 18, 2024 05:11

pingsutw approved these changes Mar 15, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by maintainer label Mar 15, 2024

pingsutw merged commit 4208da2 into flyteorg:master Mar 15, 2024
80 of 81 checks passed

austin362667 pushed a commit to austin362667/flytekit that referenced this pull request Mar 16, 2024

Add Ray Autoscaler to the Flyte-Ray plugin (flyteorg#1937)

78883ef

Signed-off-by: Yicheng-Lu-llll <[email protected]>

fiedlerNr9 pushed a commit that referenced this pull request Jul 25, 2024

Add Ray Autoscaler to the Flyte-Ray plugin (#1937)

862dec0

Signed-off-by: Yicheng-Lu-llll <[email protected]> Signed-off-by: Jan Fiedler <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Ray Autoscaler to the Flyte-Ray plugin #1937

Add Ray Autoscaler to the Flyte-Ray plugin #1937

Yicheng-Lu-llll commented Nov 6, 2023 •

edited

Loading

codecov bot commented Nov 6, 2023 •

edited

Loading

pingsutw left a comment

pingsutw Dec 14, 2023

Yicheng-Lu-llll Dec 22, 2023

eapolinario Dec 22, 2023

Yicheng-Lu-llll Dec 22, 2023

eapolinario Dec 22, 2023

Yicheng-Lu-llll Dec 22, 2023

eapolinario Dec 26, 2023

Add Ray Autoscaler to the Flyte-Ray plugin #1937

Add Ray Autoscaler to the Flyte-Ray plugin #1937

Conversation

Yicheng-Lu-llll commented Nov 6, 2023 • edited Loading

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

codecov bot commented Nov 6, 2023 • edited Loading

Codecov Report

pingsutw left a comment

Choose a reason for hiding this comment

pingsutw Dec 14, 2023

Choose a reason for hiding this comment

Yicheng-Lu-llll Dec 22, 2023

Choose a reason for hiding this comment

eapolinario Dec 22, 2023

Choose a reason for hiding this comment

Yicheng-Lu-llll Dec 22, 2023

Choose a reason for hiding this comment

eapolinario Dec 22, 2023

Choose a reason for hiding this comment

Yicheng-Lu-llll Dec 22, 2023

Choose a reason for hiding this comment

eapolinario Dec 26, 2023

Choose a reason for hiding this comment

Yicheng-Lu-llll commented Nov 6, 2023 •

edited

Loading

codecov bot commented Nov 6, 2023 •

edited

Loading