-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Slurm agent #3005
Draft
JiangJiaWei1103
wants to merge
28
commits into
flyteorg:master
Choose a base branch
from
JiangJiaWei1103:slurm-agent-dev
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+613
−12
Draft
[WIP] Slurm agent #3005
Changes from 25 commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
421d1b8
Add slurm plugin blank components
JiangJiaWei1103 1d1f806
feat: Add naive slurm agent create and get with rest api
JiangJiaWei1103 5d97126
Use asyncssh instead of REST API
JiangJiaWei1103 2e7f0f2
Test ssh communication and run sbatch
JiangJiaWei1103 9644b99
Add delete method and support slurm job state
JiangJiaWei1103 e41b181
feat: Submit and run SlurmTask on a remote Slurm cluster
JiangJiaWei1103 6db24dc
refactor: Remove redundant task_module transfer
JiangJiaWei1103 122c7f1
refactor: Remove redundant env var
JiangJiaWei1103 e9760a7
docs: Add env setup guide for local test
JiangJiaWei1103 e68fda9
docs: Add links and figures
JiangJiaWei1103 470637c
docs: Fix commit sha
JiangJiaWei1103 1579ab4
docs: Fix commit sha for demo guide
JiangJiaWei1103 0e538f0
docs: Fix links
JiangJiaWei1103 8229418
feat: Support SSH config in task config
JiangJiaWei1103 9e6d8a6
docs: Include ssh config in demo example
JiangJiaWei1103 e07b09a
refactor: Reduce ssh_conf option to slurm_host only
JiangJiaWei1103 3a7eb6d
feat: Support Slurm agent with ShellTask
JiangJiaWei1103 a815fd9
feat: Simplify Slurm job submission logic
JiangJiaWei1103 a3ea014
Added script args to agent and task
pryce-turner a109bd8
Add asyncssh to dependencies
JiangJiaWei1103 e5da665
docs: Update setup and demo for a basic use case
JiangJiaWei1103 0a3d9f1
docs: Update basic arch figure path
JiangJiaWei1103 1b0f6df
docs: Fix typo and hyperlink
JiangJiaWei1103 26cc201
fix: A tmp workaround to test agent locally without container_image
JiangJiaWei1103 16d953e
feat: Support user-defined batch script content with SlurmShellTask
JiangJiaWei1103 c743917
feat: Fall back to PythonTask for naive use cases
JiangJiaWei1103 e365dee
refactor: Define Slurm as a base task config and extend for remote sc…
JiangJiaWei1103 c1064d4
feat: Support PythonFunctionTask and reorganize agent structure
JiangJiaWei1103 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Flytekit Slurm Plugin | ||
|
||
The Slurm agent is designed to integrate Flyte workflows with Slurm-managed high-performance computing (HPC) clusters, enabling users to leverage Slurm's capability of compute resource allocation, scheduling, and monitoring. | ||
|
||
This [guide](https://github.com/JiangJiaWei1103/flytekit/blob/slurm-agent-dev/plugins/flytekit-slurm/demo.md) provides a concise overview of the design philosophy behind the Slurm agent and explains how to set up a local environment for testing the agent. |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Amazing Graph. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, bro. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
# Slurm Agent Demo | ||
|
||
In this guide, we will briefly introduce how to setup an environment to test Slurm agent locally without running the backend service (e.g., flyte agent gRPC server). It covers both basic and advanced use cases. | ||
|
||
## Table of Content | ||
* [Overview](https://github.com/JiangJiaWei1103/flytekit/blob/slurm-agent-dev/plugins/flytekit-slurm/demo.md#overview) | ||
* [Setup a Local Test Environment](https://github.com/JiangJiaWei1103/flytekit/blob/slurm-agent-dev/plugins/flytekit-slurm/demo.md#setup-a-local-test-environment) | ||
* [Flyte Client (Localhost)](https://github.com/JiangJiaWei1103/flytekit/blob/slurm-agent-dev/plugins/flytekit-slurm/demo.md#flyte-client-localhost) | ||
* [Remote Tiny Slurm Cluster](https://github.com/JiangJiaWei1103/flytekit/blob/slurm-agent-dev/plugins/flytekit-slurm/demo.md#remote-tiny-slurm-cluster) | ||
* [SSH Configuration](https://github.com/JiangJiaWei1103/flytekit/blob/slurm-agent-dev/plugins/flytekit-slurm/demo.md#ssh-configuration) | ||
* [Run a Demo](https://github.com/JiangJiaWei1103/flytekit/blob/slurm-agent-dev/plugins/flytekit-slurm/demo.md#run-a-demo) | ||
|
||
## Overview | ||
Slurm agent on the highest level has three core methods to interact with a Slurm cluster: | ||
1. `create`: Use `srun` or `sbatch` to run a job on a Slurm cluster | ||
2. `get`: Use `scontrol show job <job_id>` to monitor the Slurm job state | ||
3. `delete`: Use `scancel <job_id>` to cancel the Slurm job (this method is still under test) | ||
|
||
In the simplest form, Slurm agent supports directly running a batch script using `sbatch` on a Slurm cluster as shown below: | ||
|
||
![](https://github.com/JiangJiaWei1103/flytekit/blob/slurm-agent-dev/plugins/flytekit-slurm/assets/basic_arch.png) | ||
|
||
## Setup a Local Test Environment | ||
Without running the backend service, we can setup an environment to test Slurm agent locally. The setup consists of two main components: a client (localhost) and a remote tiny Slurm cluster. Then, we need to configure SSH connection to facilitate communication between the two, which relies on `asyncssh`. | ||
|
||
### Flyte Client (Localhost) | ||
1. Setup a local Flyte cluster following this [official guide](https://docs.flyte.org/en/latest/community/contribute/contribute_code.html#how-to-setup-dev-environment-for-flytekit) | ||
2. Build a virtual environment (e.g., conda) and activate it | ||
3. Clone Flytekit repo, checkout the Slurm agent PR, and install Flytekit | ||
``` | ||
git clone https://github.com/flyteorg/flytekit.git | ||
gh pr checkout 3005 | ||
make setup && pip install -e . | ||
``` | ||
4. Install Flytekit Slurm agent | ||
``` | ||
cd plugins/flytekit-slurm/ | ||
pip install -e . | ||
``` | ||
|
||
### Remote Tiny Slurm Cluster | ||
To simplify the setup process, we follow this [guide](https://github.com/JiangJiaWei1103/Slurm-101) to configure a single-host Slurm cluster, covering `slurmctld` (the central management daemon) and `slurmd` (the compute node daemon). | ||
|
||
### SSH Configuration | ||
To facilitate communication between the Flyte client and the remote Slurm cluster, we setup SSH on the Flyte client side as follows: | ||
1. Create a new authentication key pair | ||
``` | ||
ssh-keygen -t rsa -b 4096 | ||
``` | ||
2. Copy the public key into the remote Slurm cluster | ||
``` | ||
ssh-copy-id <username>@<remote_server_ip> | ||
``` | ||
3. Enable key-based authentication | ||
``` | ||
# ~/.ssh/config | ||
Host <host_alias> | ||
HostName <remote_server_ip> | ||
Port <ssh_port> | ||
User <username> | ||
IdentityFile <path_to_private_key> | ||
``` | ||
|
||
## Run a Demo | ||
Suppose we have a batch script to run on Slurm cluster: | ||
``` | ||
#!/bin/bash | ||
|
||
echo "Working!" >> ./remote_touch.txt | ||
``` | ||
|
||
We use the following python script to test Slurm agent on the client side. A crucial part of the task configuration is specifying the target Slurm cluster and designating the batch script's path within the cluster. | ||
|
||
```python | ||
import os | ||
|
||
from flytekit import workflow | ||
from flytekitplugins.slurm import Slurm, SlurmTask | ||
|
||
|
||
echo_job = SlurmTask( | ||
name="echo-job-name", | ||
task_config=Slurm( | ||
slurm_host="<host_alias>", | ||
batch_script_path="<path_to_batch_script_within_cluster>", | ||
sbatch_conf={ | ||
"partition": "debug", | ||
"job-name": "tiny-slurm", | ||
} | ||
) | ||
) | ||
|
||
|
||
@workflow | ||
def wf() -> None: | ||
echo_job() | ||
|
||
|
||
if __name__ == "__main__": | ||
from flytekit.clis.sdk_in_container import pyflyte | ||
from click.testing import CliRunner | ||
|
||
runner = CliRunner() | ||
path = os.path.realpath(__file__) | ||
|
||
print(f">>> LOCAL EXEC <<<") | ||
result = runner.invoke(pyflyte.main, ["run", path, "wf"]) | ||
print(result.output) | ||
``` | ||
|
||
After the Slurm job is completed, we can find the following result on Slurm cluster: | ||
|
||
![](https://github.com/JiangJiaWei1103/flytekit/blob/slurm-agent-dev/plugins/flytekit-slurm/assets/slurm_basic_result.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
from .agent import SlurmAgent | ||
from .task import Slurm, SlurmShell, SlurmShellTask, SlurmTask |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
import tempfile | ||
from dataclasses import dataclass | ||
from typing import Dict, List, Optional | ||
|
||
import asyncssh | ||
from asyncssh import SSHClientConnection | ||
|
||
from flytekit.extend.backend.base_agent import AgentRegistry, AsyncAgentBase, Resource, ResourceMeta | ||
from flytekit.extend.backend.utils import convert_to_flyte_phase | ||
from flytekit.models.literals import LiteralMap | ||
from flytekit.models.task import TaskTemplate | ||
|
||
|
||
@dataclass | ||
class SlurmJobMetadata(ResourceMeta): | ||
"""Slurm job metadata. | ||
|
||
Args: | ||
job_id: Slurm job id. | ||
""" | ||
|
||
job_id: str | ||
slurm_host: str | ||
|
||
|
||
class SlurmAgent(AsyncAgentBase): | ||
name = "Slurm Agent" | ||
|
||
# SSH connection pool for multi-host environment | ||
# _ssh_clients: Dict[str, SSHClientConnection] | ||
_conn: Optional[SSHClientConnection] = None | ||
|
||
# Tmp remote path of the batch script | ||
REMOTE_PATH = "/tmp/echo_shell.slurm" | ||
|
||
# Dummy script content | ||
DUMMY_SCRIPT = "#!/bin/bash" | ||
|
||
def __init__(self) -> None: | ||
super(SlurmAgent, self).__init__(task_type_name="slurm", metadata_type=SlurmJobMetadata) | ||
|
||
async def create( | ||
self, | ||
task_template: TaskTemplate, | ||
inputs: Optional[LiteralMap] = None, | ||
**kwargs, | ||
) -> SlurmJobMetadata: | ||
# Retrieve task config | ||
slurm_host = task_template.custom["slurm_host"] | ||
batch_script_args = task_template.custom["batch_script_args"] | ||
sbatch_conf = task_template.custom["sbatch_conf"] | ||
|
||
# Construct sbatch command for Slurm cluster | ||
upload_script = False | ||
if "script" in task_template.custom: | ||
script = task_template.custom["script"] | ||
assert script != self.DUMMY_SCRIPT, "Please write the user-defined batch script content." | ||
|
||
batch_script_path = self.REMOTE_PATH | ||
upload_script = True | ||
else: | ||
# Assume the batch script is already on Slurm | ||
batch_script_path = task_template.custom["batch_script_path"] | ||
cmd = _get_sbatch_cmd( | ||
sbatch_conf=sbatch_conf, batch_script_path=batch_script_path, batch_script_args=batch_script_args | ||
) | ||
|
||
# Run Slurm job | ||
if self._conn is None: | ||
await self._connect(slurm_host) | ||
if upload_script: | ||
with tempfile.NamedTemporaryFile("w") as f: | ||
f.write(script) | ||
f.flush() | ||
async with self._conn.start_sftp_client() as sftp: | ||
await sftp.put(f.name, self.REMOTE_PATH) | ||
res = await self._conn.run(cmd, check=True) | ||
|
||
# Retrieve Slurm job id | ||
job_id = res.stdout.split()[-1] | ||
|
||
return SlurmJobMetadata(job_id=job_id, slurm_host=slurm_host) | ||
|
||
async def get(self, resource_meta: SlurmJobMetadata, **kwargs) -> Resource: | ||
await self._connect(resource_meta.slurm_host) | ||
res = await self._conn.run(f"scontrol show job {resource_meta.job_id}", check=True) | ||
|
||
# Determine the current flyte phase from Slurm job state | ||
job_state = "running" | ||
for o in res.stdout.split(" "): | ||
if "JobState" in o: | ||
job_state = o.split("=")[1].strip().lower() | ||
cur_phase = convert_to_flyte_phase(job_state) | ||
|
||
return Resource(phase=cur_phase) | ||
|
||
async def delete(self, resource_meta: SlurmJobMetadata, **kwargs) -> None: | ||
await self._connect(resource_meta.slurm_host) | ||
_ = await self._conn.run(f"scancel {resource_meta.job_id}", check=True) | ||
|
||
async def _connect(self, slurm_host: str) -> None: | ||
"""Make an SSH client connection.""" | ||
self._conn = await asyncssh.connect(host=slurm_host) | ||
|
||
|
||
def _get_sbatch_cmd(sbatch_conf: Dict[str, str], batch_script_path: str, batch_script_args: List[str] = None) -> str: | ||
"""Construct Slurm sbatch command. | ||
|
||
We assume all main scripts and dependencies are on Slurm cluster. | ||
|
||
Args: | ||
sbatch_conf: Options of srun command. | ||
batch_script_path: Absolute path of the batch script on Slurm cluster. | ||
batch_script_args: Additional args for the batch script on Slurm cluster. | ||
|
||
Returns: | ||
cmd: Slurm sbatch command. | ||
""" | ||
# Setup sbatch options | ||
cmd = ["sbatch"] | ||
for opt, val in sbatch_conf.items(): | ||
cmd.extend([f"--{opt}", str(val)]) | ||
|
||
# Assign the batch script to run | ||
cmd.append(batch_script_path) | ||
|
||
# Add args if present | ||
if batch_script_args: | ||
for arg in batch_script_args: | ||
cmd.append(arg) | ||
|
||
cmd = " ".join(cmd) | ||
return cmd | ||
|
||
|
||
AgentRegistry.register(SlurmAgent()) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
is this for
shell task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we define a
SlurmTask
without specifyingcontainer_image
(as the example python script provided above),ctx.serialization_settings
will beNone
. Then, an error is raised which describes thatPythonAutoContainerTask
needs an image.I think this is just a temporary workaround for local test and I'm still pondering how to better handle this issue.