From 0e5a6027eaba5ed2b9094287b5b6613333cabc74 Mon Sep 17 00:00:00 2001
From: "Jane (Yuan) Xu" <31798555+janeyx99@users.noreply.github.com>
Date: Tue, 8 Oct 2024 18:10:14 -0400
Subject: [PATCH 1/2] Remove nonexistent flag for acc offloading in
 memory_optimizations.rst

I forgot to remove mentions when I removed the flag in the last PR
---
 docs/source/tutorials/memory_optimizations.rst | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/docs/source/tutorials/memory_optimizations.rst b/docs/source/tutorials/memory_optimizations.rst
index 20ab8918ea..9cd344e388 100644
--- a/docs/source/tutorials/memory_optimizations.rst
+++ b/docs/source/tutorials/memory_optimizations.rst
@@ -98,19 +98,17 @@ See `PyTorch autograd hook tutorial <https://pytorch.org/tutorials/intermediate/
 for more details about how this is implemented through saved_tensors_hooks.
 
 This setting is especially helpful for larger batch sizes, or longer context lengths when you're memory constrained.
-However, these savings in memory can come at the cost of training speed (i.e. tokens per-second), as it takes runtime
-and resources to move Tensors from GPU to CPU and back. The implementation in torchtune has the ``offload_with_streams``
-option to use multiple CUDA streams in order to overlap the extra communication with the computation to hide the extra
-runtime. As the communication workload is variable depending on the number and size of tensors being offloaded, it is
-common to not offload every single activation. In fact, once can use offloading in conjunction with activations
+While of course it takes runtime and resources to move Tensors from GPU to CPU and back, the implementation in
+torchtune uses multiple CUDA streams (when available) in order to overlap the extra communication with the computation
+to hide the extra runtime. As the communication workload is variable depending on the number and size of tensors being
+offloaded, it is common to not offload every single activation. In fact, once can use offloading in conjunction with activations
 checkpointing, where all activations will either be recomputed later in the backward or brought back from the CPU.
 
 *Sounds great! How do I use it?*
 
 To enable activation offloading, use the ``enable_activation_offloading`` config entry or flag
 in our lora finetuning single device recipe, e.g. ``enable_activation_offloading=True``. To allow
-usage of streams, make sure you are on a torch version later than PyTorch 2.5.0.dev20240907 and
-specify ``offload_with_streams=True``.
+usage of streams, make sure you are on a torch version later than PyTorch 2.5.0.dev20240907.
 
 .. _glossary_grad_accm:
 

From 48fef22dd0d91ec1347b499e0e2288d0e34e84dd Mon Sep 17 00:00:00 2001
From: "Jane (Yuan) Xu" <31798555+janeyx99@users.noreply.github.com>
Date: Tue, 8 Oct 2024 18:21:58 -0400
Subject: [PATCH 2/2] Update docs/source/tutorials/memory_optimizations.rst

Co-authored-by: ebsmothers <ebs@meta.com>
---
 docs/source/tutorials/memory_optimizations.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/tutorials/memory_optimizations.rst b/docs/source/tutorials/memory_optimizations.rst
index 9cd344e388..be40d89134 100644
--- a/docs/source/tutorials/memory_optimizations.rst
+++ b/docs/source/tutorials/memory_optimizations.rst
@@ -101,7 +101,7 @@ This setting is especially helpful for larger batch sizes, or longer context len
 While of course it takes runtime and resources to move Tensors from GPU to CPU and back, the implementation in
 torchtune uses multiple CUDA streams (when available) in order to overlap the extra communication with the computation
 to hide the extra runtime. As the communication workload is variable depending on the number and size of tensors being
-offloaded, it is common to not offload every single activation. In fact, once can use offloading in conjunction with activations
+offloaded, it is common to not offload every single activation. In fact, one can use offloading in conjunction with activations
 checkpointing, where all activations will either be recomputed later in the backward or brought back from the CPU.
 
 *Sounds great! How do I use it?*