From 0e5a6027eaba5ed2b9094287b5b6613333cabc74 Mon Sep 17 00:00:00 2001 From: "Jane (Yuan) Xu" <31798555+janeyx99@users.noreply.github.com> Date: Tue, 8 Oct 2024 18:10:14 -0400 Subject: [PATCH 1/2] Remove nonexistent flag for acc offloading in memory_optimizations.rst I forgot to remove mentions when I removed the flag in the last PR --- docs/source/tutorials/memory_optimizations.rst | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/docs/source/tutorials/memory_optimizations.rst b/docs/source/tutorials/memory_optimizations.rst index 20ab8918ea..9cd344e388 100644 --- a/docs/source/tutorials/memory_optimizations.rst +++ b/docs/source/tutorials/memory_optimizations.rst @@ -98,19 +98,17 @@ See `PyTorch autograd hook tutorial Date: Tue, 8 Oct 2024 18:21:58 -0400 Subject: [PATCH 2/2] Update docs/source/tutorials/memory_optimizations.rst Co-authored-by: ebsmothers --- docs/source/tutorials/memory_optimizations.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/tutorials/memory_optimizations.rst b/docs/source/tutorials/memory_optimizations.rst index 9cd344e388..be40d89134 100644 --- a/docs/source/tutorials/memory_optimizations.rst +++ b/docs/source/tutorials/memory_optimizations.rst @@ -101,7 +101,7 @@ This setting is especially helpful for larger batch sizes, or longer context len While of course it takes runtime and resources to move Tensors from GPU to CPU and back, the implementation in torchtune uses multiple CUDA streams (when available) in order to overlap the extra communication with the computation to hide the extra runtime. As the communication workload is variable depending on the number and size of tensors being -offloaded, it is common to not offload every single activation. In fact, once can use offloading in conjunction with activations +offloaded, it is common to not offload every single activation. In fact, one can use offloading in conjunction with activations checkpointing, where all activations will either be recomputed later in the backward or brought back from the CPU. *Sounds great! How do I use it?*