Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove nonexistent flag for acc offloading in memory_optimizations.rst #1772

Merged
merged 2 commits into from
Oct 9, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 5 additions & 7 deletions docs/source/tutorials/memory_optimizations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,19 +98,17 @@ See `PyTorch autograd hook tutorial <https://pytorch.org/tutorials/intermediate/
for more details about how this is implemented through saved_tensors_hooks.

This setting is especially helpful for larger batch sizes, or longer context lengths when you're memory constrained.
However, these savings in memory can come at the cost of training speed (i.e. tokens per-second), as it takes runtime
and resources to move Tensors from GPU to CPU and back. The implementation in torchtune has the ``offload_with_streams``
option to use multiple CUDA streams in order to overlap the extra communication with the computation to hide the extra
runtime. As the communication workload is variable depending on the number and size of tensors being offloaded, it is
common to not offload every single activation. In fact, once can use offloading in conjunction with activations
While of course it takes runtime and resources to move Tensors from GPU to CPU and back, the implementation in
torchtune uses multiple CUDA streams (when available) in order to overlap the extra communication with the computation
to hide the extra runtime. As the communication workload is variable depending on the number and size of tensors being
offloaded, it is common to not offload every single activation. In fact, one can use offloading in conjunction with activations
checkpointing, where all activations will either be recomputed later in the backward or brought back from the CPU.

*Sounds great! How do I use it?*

To enable activation offloading, use the ``enable_activation_offloading`` config entry or flag
in our lora finetuning single device recipe, e.g. ``enable_activation_offloading=True``. To allow
usage of streams, make sure you are on a torch version later than PyTorch 2.5.0.dev20240907 and
specify ``offload_with_streams=True``.
usage of streams, make sure you are on a torch version later than PyTorch 2.5.0.dev20240907.

.. _glossary_grad_accm:

Expand Down
Loading