Hydra Torchrun Launcher

This plugin aims to make the launching of torch distributed training configurable in Hydra.

The configuration is as follows:

hydra:
  launcher:
    _target_: hydra_plugins.hydra_torchrun_launcher.distributed_launcher.TorchDistributedLauncher
    nnodes: '1:1'
    nproc_per_node: '1'
    rdzv_backend: static
    rdzv_endpoint: ''
    rdzv_id: none
    rdzv_conf: ''
    standalone: false
    max_restarts: 0
    monitor_interval: 5
    start_method: spawn # Support start_method=spawn, required by CUDA
    role: default
    module: false
    no_python: false
    run_path: false
    log_dir: null
    redirects: '0'
    tee: '0'
    node_rank: 0
    master_addr: '127.0.0.1'
    master_port: 29500
    local_addr: null
    training_script: ''
    training_script_args: [ ]

The meaning of each parameter matches exactly with the arguments of torchrun. Please refer to its documentation for a more detailed introduction.

Installation

pip3 install git+https://github.com/acherstyx/hydra-torchrun-launcher.git

Usage

python3 run_net.py --multirun hydra/launcher=torchrun hydra.launcher.nproc_per_node=8

The behavior of this example should be the same as launching with torchrun:

torchrun --nproc_per_node=8 run_net.py

Acknowledgement

This plugin is modified from the hydra-torchrun-launcher plugin at hydra/contrib. Currently, the main difference includes:

Following loky, the pickling error described in facebookresearch/hydra#2038 is fixed through the use of cloudpickle. This version of the launcher now supports start_method=spawn, which is required by CUDA (see pytorch/pytorch#40403).
The config is adjusted to match with torchrun.
Fix hydra.runtime.output_dir missing after spawn.
Fix the return value of multi-node training.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
hydra_plugins/hydra_torchrun_launcher		hydra_plugins/hydra_torchrun_launcher
test		test
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hydra Torchrun Launcher

Installation

Usage

Acknowledgement

About

Releases 2

Packages

Languages

License

acherstyx/hydra-torchrun-launcher

Folders and files

Latest commit

History

Repository files navigation

Hydra Torchrun Launcher

Installation

Usage

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages