Add DDP Communication Hooks #2841

yhna940 · 2024-06-09T08:43:54Z

What does this PR do?

This PR adds support for DDP communication hooks to the accelerate library. Similar to frameworks like PyTorch Lightning and Detectron, these hooks provide an interface to control how gradients are communicated across workers, overriding the standard allreduce in DistributedDataParallel. This feature enables the use of performance-improving communication hooks when using multiple nodes.

Motivation and Context

DDP communication hooks allow users to customize and optimize gradient communication, potentially improving training performance in distributed settings.

Based on the official PyTorch documentation here, I've implemented three default hooks: PowerSGD, FP16, and BF16. These hooks provide performance improvements in distributed training scenarios.

The implementation for registering these hooks was inspired by the PyTorch Lightning implementation, which can be found here.

Fixes # (issue)

N/A

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case: here
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

…d-hook

HuggingFaceDocBuilderDev · 2024-06-09T13:41:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

Thanks! Overall this looks great to me, a few requests though:

Can we add an example in examples/by_feature specifically showcasing this usage
Can we add some documentation to the official docs on it. Ideally a Usage Guide if nothing else (though a Concept Guide too would be ideal!)

muellerzr · 2024-06-10T15:41:26Z

src/accelerate/utils/dataclasses.py

+        - **BATCHED_POWER_SGD** -- DDP communication hook to use batched PowerSGD
+    """
+
+    # Subclassing str as well as Enum allows the `DistributedType` to be JSON-serializable out of the box.


Copy/paste leftover :)

Thanks for fixing this :)

src/accelerate/utils/dataclasses.py

yhna940 · 2024-06-11T13:56:23Z

Thanks! Overall this looks great to me, a few requests though:

Can we add an example in examples/by_feature specifically showcasing this usage

Can we add some documentation to the official docs on it. Ideally a Usage Guide if nothing else (though a Concept Guide too would be ideal!)

Thank you for the review and suggestions :)

In response to your feedback, I have made the following updates:

Added an example specifically showcasing the usage of DDP communication hooks in examples/by_feature/ddp_comm_hook.py.
Created a detailed usage guide for DDP communication hooks, which is now available in the official documentation under docs/source/usage_guides/ddp_comm_hook.md.

These additions aim to provide clear guidance on how to utilize DDP communication hooks with the 🤗 Accelerate library, enhancing the usability and performance of distributed training.

Please let me know if there are any further adjustments or additions required.

muellerzr

Thanks for doing this! cc @SunMarc for a second eye, and @stevhliu for docs :)

examples/by_feature/ddp_comm_hook.py

Co-authored-by: Zach Mueller <[email protected]>

SunMarc

Thanks for the clean PR @yhna940 ! LGTM ! Just a nit, could you add some details about comm_wrapper or comm_state_option ? From the tests, it looks like this is only useful for POWER_SGD.

yhna940 · 2024-06-12T14:29:41Z

Thank you for the feedback and the suggestion @SunMarc. I have added more details about comm_wrapper and comm_state_option in the documentation. As you mentioned, comm_state_option is currently only applied to PowerSGD in this PR.

However, there are other state-using options such as post_localSGD_hook available in PyTorch. These hooks require specific optimizers and provide advanced features for gradient communication. For example, post_localSGD_hook is closely tied to the PostLocalSGDOptimizer and involves additional setup:

import torch
import torch.distributed as dist
import torch.distributed.algorithms.model_averaging.averagers as averagers
import torch.nn as nn
from torch.distributed.optim import PostLocalSGDOptimizer
from torch.distributed.algorithms.ddp_comm_hooks.post_localSGD_hook import (
    PostLocalSGDState,
    post_localSGD_hook,
)

model = nn.parallel.DistributedDataParallel(
    module, device_ids=[rank], output_device=rank
)

# Register a post-localSGD communication hook.
state = PostLocalSGDState(process_group=None, subgroup=None, start_localSGD_iter=100)
model.register_comm_hook(state, post_localSGD_hook)

# Create a post-localSGD optimizer that wraps a local optimizer.
# Note that `warmup_steps` used in `PostLocalSGDOptimizer` must be the same as
# `start_localSGD_iter` used in `PostLocalSGDState`.
local_optim = torch.optim.SGD(params=model.parameters(), lr=0.01)
opt = PostLocalSGDOptimizer(
    optim=local_optim,
    averager=averagers.PeriodicModelAverager(period=4, warmup_steps=100)
)

# In the first 100 steps, DDP runs global gradient averaging at every step.
# After 100 steps, DDP runs gradient averaging within each subgroup (intra-node by default),
# and post-localSGD optimizer runs global model averaging every 4 steps after applying the local optimizer.
for step in range(0, 200):
    opt.zero_grad()
    loss = loss_fn(output, labels)
    loss.backward()
    opt.step()

These advanced hooks were not included in this PR because the PyTorch official documentation primarily highlights PowerSGD, FP16, and BF16 hooks. You can find more information about additional hooks in the PyTorch DDP Communication Hooks documentation and the PyTorch GitHub repository.

SunMarc

Awesome @yhna940 ! Thanks for iterating ! Could you also add to the docs/source/_toctree.yml file the link to the new usage guide you wrote ? I think that you can put in the Training section !

yhna940 · 2024-06-12T15:27:47Z

Thanks for the quick review @SunMarc , I've done it :)

stevhliu

Very nice job! 👏

stevhliu · 2024-06-12T16:02:39Z

docs/source/_toctree.yml

@@ -58,6 +58,8 @@
      title: Apple M1 GPUs
    - local: usage_guides/ipex
      title: IPEX training with CPU
+    - local: usage_guides/ddp_comm_hook


I would consider putting it after or before Fully Sharded Data Parallelism since they're more closely related.