-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][TorchElastic] topology info in training apps/ranks in #57
base: master
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# Providing Topology Information to Training Apps/Ranks | ||
|
||
**Authors:** | ||
* @kurman | ||
|
||
|
||
## **Summary** | ||
Provide topology information to each rank/trainer process that can be used by trainers to optimize and potentially automate pytorch distributed models. | ||
|
||
|
||
## **Motivation** | ||
As the size of jobs are increasing, ML jobs utilize parallelism to a) increase throughput and b) to allow model sizes that don’t fit into a single GPU or even node. Those techniques are known as Data Parallelism (DP), Model Parallelism (MP) and Pipeline Parallelism (PP). Each of them have specific trade-offs and can be combined. | ||
|
||
Using those techniques requires some knowledge of underlying topology to make right tradeoffs based on communication overhead (speed, throughput), available GPU memory and use of non-accelerator resources (CPU/RAM). PyTorch has come up with various generic distributed modeling solutions like DTensor and DeviceMesh. However pytorch distributed primitives provide no automation today for taking advantage of topology info to perform model partitioning, it is the job of the user to configure the 2-D or 3-D (MP+DP+PP) parallelism solutions in a topology aware manner for optimal performance. | ||
|
||
|
||
## **Proposed Implementation** | ||
Provide topology information to each rank/trainer process that can be used to create optimized computational graphs for Torch modules. | ||
|
||
Current HPC/Gang scheduled jobs rely on number of variables that is passed to each trainer process: | ||
- `WORLD_SIZE` - total number of workers | ||
- `RANK` - unique rank of a worker (0…WORLD_SIZE-1) | ||
- `LOCAL_RANK` - unique rank of a worker on a node, typically used to exclusively assign accelerators on the host. | ||
|
||
New proposed `WORLD_TOPOLOGY_FILE` environment variable will reference a local filesystem file that will provide information about the underlying topology. | ||
|
||
|
||
### **Content of the topology information file** | ||
Topology information is a representation of graph where nodes are devices assigned to a RANK and edges are connectivity type with optional: | ||
- Latency | ||
- Bandwidth, and | ||
- Available number of channels | ||
|
||
The connectivity type assumes HPC type of workloads and can be scoped to following types: | ||
- NVLink | ||
- NVSwitch | ||
- PCIe | ||
- IB | ||
- NIC/Ethernet | ||
|
||
Eg: | ||
|
||
| | RANK0 | RANK1 | RANK2 | | ||
|-----|---------------------------|-------------------------|----------------------| | ||
|RANK0| X | NVLink [22us,64GB/s, 4] | IB[600us, 0.4GB/s,4] | | ||
|RANK1|NVLink [22us,64GB/s, 4] | X | IB[600us, 0.4GB/s,4] | | ||
|RANK2|IB[600us, 0.4GB/s,4] | IB[600us, 0.4GB/s,4] | X | | ||
|
||
|
||
|
||
Out-of-scope for now: | ||
- System resources allocated for each rank (eg MEM/CPU/GPU) | ||
- Those tend to be homogeneous | ||
kurman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Most of the can be easily detected at runtime by the trainer code | ||
- More fine-grained details based on communication pattern (p2p vs collectives) | ||
|
||
### **Format of the topology information file** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could we define a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +on API, adding that to proposal. On additional datasources: we need a single point of discovery of this data at application level therefore it has to be controlled via underlying infra setup. We can extend to other datasources but I believe factory mechanism should be encapsulated. |
||
|
||
- File format: json | ||
- Define version = 0.1 | ||
- Extensible: | ||
- Avoid using lists and use named properties, this allows adding new properties later. | ||
|
||
Eg: | ||
```json | ||
{ | ||
"version": "0.1", | ||
"ranks": { | ||
"0": { | ||
"peers": { | ||
"1": { | ||
"connection": { | ||
kurman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"latency": { | ||
"value": "22", | ||
"measurement": "us" | ||
}, | ||
"bandwidth": { | ||
"value": "64", | ||
"measurement": "GB/s" | ||
}, | ||
"channels": { | ||
"value": "4" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this number of connections or lanes? multi-NIC vs # of nvlink lanes? |
||
} | ||
} | ||
} | ||
} | ||
}, | ||
"1": {"peers": {...}} | ||
} | ||
} | ||
``` | ||
|
||
|
||
### **Implementation details** | ||
[torchrun/torchelastic](https://pytorch.org/docs/stable/elastic/run.html) will | ||
- Build the topology representation during the rendezvous process | ||
kurman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Save the file in local filesystem | ||
- Launch the job with `WORLD_TOPOLOGY_FILE` env variable with value of the location of the file. | ||
|
||
Topology representation can either reuse or reimplement part of topology discovery used in [NCCL](https://github.com/NVIDIA/nccl/blob/master/src/graph/topo.cc) where it uses information from /sys. | ||
|
||
## **Potential of the usecases** | ||
|
||
- Locality aware DeviceMesh definition to be used by DTensor using topology information | ||
- Use of this information to place MP on NVLinked accelerators, and DDP/PP on IB connected hosts. | ||
- DistCompiler can be used to construct optimal execution graph | ||
- Replication of checkpoints/snapshots on non-adjacent nodes to provide more reliable fault tolerance | ||
|
||
|
||
#### Additional Context | ||
https://arxiv.org/abs/1903.04611 | ||
https://github.com/pytorch/pytorch/issues/88838 | ||
https://pytorch.org/docs/stable/distributed.elastic.html | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be a required? if not what would the default topology be? if so, what is the proposal for backwards-compatibility?