Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-0030 Consolidate TorchElastic and TorchX #53

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

kiukchung
Copy link

@kiukchung kiukchung commented Apr 18, 2023

TorchX was originally created to help PyTorch users in OSS run their PyTorch applications on widely adopted infrastructure setups and schedulers. Today TorchX supports most AI infra setups which use SLURM, Kubernetes, Ray, Batch services (AWS, GCP, and Azure) and Kubernetes-MCAD (IBM). In the recent months we've seen TorchX gain traction as evidenced by several blog posts detailing how to use PyTorch on XYZ using TorchX:

  1. How to run PyTorch on Vertex AI using TorchX
  2. Large-scale distributed training with TorchX and Ray
  3. Scaling distributed training with AWS Trainium and Amazon EKS
  4. Rapidly deploy PyTorch applications on Batch using TorchX

While TorchX launches PyTorch jobs onto local and remote schedulers, TorchElastic (aka torchrun) is responsible for launching PyTorch processes (ranks). But for the user both tools run their training scripts and seemingly overlap in functionality.

This RFC proposes that:

  1. We consolidate TorchElastic and TorchX as a single module
  2. That we do so by:
    1. Upstreaming TorchX as torch.x (under a new submodule called x)
    2. Pull torch.distributed.elastic and put it under torch.x.run

@kiukchung
Copy link
Author


## **Motivation**

1. **Uniform user experience (UX)**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivations would lead me to believe that either of the below are valid options

  1. Moving elastic out of core into torchx
  2. Moving torchx into core with elastic

At a high level I agree that the needs served by torchrun and torchx are quite similar so should be coupled more closely. One worry I have is that typically in core they're strict about adding new dependencies and often prefer out of core registration mechanisms

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I don't want to bring additional deps + unnecssary bloat to core either. However there are number of benefits for PyTorch to define the specifications of how PyTorch authored scripts/applications are installed and run on the target infrastructure:

  1. PyTorch can intentionally ask for specific features and requirements from the infra to define the runtime environment to its liking (e.g. a flash storage side-car for checkpointing, node-level retries from schedulers for fault-tolerance, specific network topologies for certain workloads)
  2. Helps & encourages the infra/platform providers to support PyTorch in the way PyTorch is natively designed to be run.
    1. and 2) consequentially makes the PyTorch UX more uniform, portable, and easy across different infra providers.


1. **Uniform user experience (UX)**
1. Both `torchrun` and `torchx` **launch** PyTorch scripts
2. For instance, `torchx run -s local dist.ddp -j 1x2 train.py` is basically equivalent to `torchrun --nnodes 1 --nproc_per_node 2 train.py`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me feel that maybe torchrun is actually the better name, it's more descriptive

consolidate TorchElastic and TorchX *outside* the pytorch/pytorch repo. However, due to the prevalent usage
of `torchrun`, pulling torchelastic out of PyTorch would mean:

1. Makes `torch-2.x` backwards incompatible since users would have to separately install `torchx` to get `torchrun`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like these drawbacks are particularly bad

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something to be said about having a built-in runner CLI that helps you launch simple and complex jobs onto some commonly used schedulers (e.g. slurm, k8s). Makes the UX frictionless (especially for distributed).

We discussed the torchvision model where the repo is kept separate but the install instructions on https://pytorch.org by default install both torch and torchvision

image

The fundamental difference though is that not all pytorch users need torchvision, but all pytorch users need to run their scripts at some point.


While (i) is lighter weight compared to (ii) both are rather expensive to run at the pace of commits to PyTorch.
Therefore, we propose that these integ tests are run:
1. Only for commits that touch files in `torch/x/**`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what inductor does for example and makes sense to me

in such a way that surviving hosts can wait-for a failed host to be replaced then admit the newly replace host
into the worker group (provided by torchelastic).
3. **Out-of-the-box support for job launchers in PyTorch**
1. Without additional installs, PyTorch users can run PyTorch scripts both locally and remotely onto widely used schedulers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually if this is all living in core it might make sense to have all schedulers out core with maybe some exception for the very well maintained or popular ones like K8, that way you can standardize a job launching interface without having to contribute more experimental schedulers in core where they may likely rot

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I think it makes sense to include the schedulers that are not CSP specific (which would be SLURM and Kubernetes - SLURM doesn't even have a build-time dep since it doesn't have a python client), and keep the CSP ones (e.g. AWS Batch) in torchx so that eventually we can make the relevant service teams own the integration.

#### non-BC Breaking Changes

1. `torch/distributed/launch.py` and `torch/distributed/run.py` would have to remain for a few releases,
but will both point to `torch/x/run.py`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consolidation makes a lot of sense to me. Do you and your plan plan to support this change all the way till torch/distributed/launch.py and torch/distributed/run.py are eventually removed from the repo?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's the plan. Ideally we'd like to keep torchrun still in pytorch (which is what this RFC is advocating for) but torch/distributed/launch.py | run.py would be moved out to torch.x.run


1. Programmatic usages of torchelastic (e.g. `import torch.distributed.elastic`) are **NOT** BC and the user has to codemod
references to `torch.distributed.elastic` to `torch.x.elastic`.
1. **Impact**: Besides Meta-internal use-cases (which can be easily codemoded) the only external programmatic usage

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem, we can do the codemod for internal use-cases.

references to `torch.distributed.elastic` to `torch.x.elastic`.
1. **Impact**: Besides Meta-internal use-cases (which can be easily codemoded) the only external programmatic usage
of torchelastic is in DeepSpeed (see [GitHub](https://github.com/microsoft/DeepSpeed/search?q=torch.distributed.elastic))
for which we can work with the project owner to resolve.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we confident this is the only external usage? I do recall seeing many users asking questions about TorchElastic/TorchX on forum.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT programmatic usage + more well known 3rd party lib yes. Most people use torchrun or python -m torch.distributed.launch. Even the kubeflow pytorch op uses python -m torch.distribued.run (https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml)

> NOTE: The rest of the doc assumes Option 2 as this is the recommended option
3. (Option 3) Upstream everything in TorchX:
1. PROS: Makes maintenance and CI/CD simple since changes to the specs can be tested for downstream BC easily
2. CONS: bloat-torch

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know, by how much?
What about the option of upstreaming everything in torchx to torch but just including core torchx deps + kuebrenetes + slurm deps (mentioned in Option2) and install the other dependencies only when explicitly installing the variant with extra-requires like we do in torchx torch[batch] or something like that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right so regardless of which option, the "extra-deps" of torchx would never be added as deps (even extra deps) of torch. We won't mess with the extras_require of torch (aka no torch[batch]) and instead would fail-fast with a warning prompting the user to install additional packages.

2. (OSS-only) `torchx` CLI users would have to switch over to `torchrun`.
1. **Impact**: Every OSS user that currently uses `torchx` CLI would have to one-time opt-in and switch over to torchrun
2. **To make BC**:
1. **(Option 1)** Add console script `torchx` to PyTorch's `setup.py`, which would act as a symlink discussed in the non-BC section for Meta-internal use-case

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the symlink option to keep disruptions for users minimal but we need to migrate at some point

3. (Option 3) Upstream everything in TorchX:
1. PROS: Makes maintenance and CI/CD simple since changes to the specs can be tested for downstream BC easily
2. CONS: bloat-torch
3. Merge the functionalities of `torchx` CLI to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidating around torchrun sounds like a good idea. My understanding from the above is that:

  • torchrun would now take a scheduler arg, and users can launch on Slurm/Kubernetes out of the box with just the PyTorch package.
  • Other schedulers will be implemented in the torchx repo, and users would need to install torchx + do some registration to use those schedulers.
  • Runtime libraries like artifact tracking will remain in torchx, as will non-runtime stuff like pre-defined components, the pipelines API, etc.

With this approach what will be the framing for the torchx repo? It may come across as a set of disparate pieces that might be useful during training.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torchrun would now take a scheduler arg, and users can launch on Slurm/Kubernetes out of the box with just the PyTorch package.

Yes exactly. There's a bit of tricky BC to deal with in terms of CLI arguments (torchrun and torchx CLIs currently have completely different args - torchx focuses more on actions that can be done at the job level whereas torchrun focuses on a "single action": launching multiple procs on the local node).

Other schedulers will be implemented in the torchx repo, and users would need to install torchx + do some registration to use those schedulers.

The idea is to upstream the torchx.specs.* (iface and APIs) to torch.x.* and optionally (but strongly encouraged) upstream 1 non-trivial scheduler integration. IMO SLURM is a good candidate since it has no "client" dependencies (e.g. torchx.schedulers.kubernetes uses the kubernetes python client). The existing schedulers (k8s, aws-batch, gcp-batch, azure-batch, ray) should stay in the torchx repo as an optional plugin.

As for registration of the schedulers, this is already possible in torchx by design (see Registering Custom Schedulers). In fact this is how we currently make only the internal schedulers visible to Meta-internal uses.

Runtime libraries like artifact tracking will remain in torchx, as will non-runtime stuff like pre-defined components, the pipelines API, etc.

Yes for now. torchx.runtime.* (including artifact tracking) APIs are currently not stable therefore, I think should be excluded from the upstream. The pipeline integrations should be deprecated (since no one seems to be using it). As for the pre-def components, I think its useful to have python and ddp (probably should be renamed to spmd) in the main repo since those are the two most used launch topologies. They would serve two purposes:

#. As a vanilla/basic quick start components to launch test and PTD jobs
#. As reference implementations to further customize those components to fit your exact needs/infra.

With this approach what will be the framing for the torchx repo? It may come across as a set of disparate pieces that might be useful during training.

IMHO the TorchX repo should serve as a place to hold commonly used plugins (e.g. widely used scheduler impls, most commonly used resources (host types), etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants