Make torch TP composable with torch.compile #2352

kwen2501 · 2024-12-04T20:05:42Z

Motivation

Previously we have this block of code in RowwiseParallel:

# wait for the output to be ready
if isinstance(outputs, AsyncCollectiveTensor):
  return outputs.wait()
else:
  return outputs

When dynamo traces, outputs is an AsyncCollectiveTensor, so it burns in the outputs.wait() call. But at run time, somehow outputs become a normal tensor, thus we hit the following error:

AttributeError: 'FunctionalTensor' object has no attribute 'wait'

This is a bug in Dynamo.

Modifications

To work around the above Dynamo bug, we swap the above if-else block with a determined behavior:

torch.distributed._functional_collectives.wait_tensor(outputs)

wait_tensor accepts both regular tensor and AsyncCollectiveTensor. How it handles them is implementation detail it keeps inside. This would work in both eager and compiled mode.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

cc @jerryzh168 @bdhirsh

bdhirsh · 2024-12-04T21:31:55Z

Here's a minimal repro:

diff --git a/test/distributed/_tensor/test_dtensor_compile.py b/test/distributed/_tensor/test_dtensor_compile.py
index 91fbc396f8e..09a2bf8f183 100644
--- a/test/distributed/_tensor/test_dtensor_compile.py
+++ b/test/distributed/_tensor/test_dtensor_compile.py
@@ -544,12 +544,18 @@ class TestDTensorCompile(torch._dynamo.test_case.TestCase):

     def test_dynamo_dtensor_from_local_redistribute(self):
         mesh = DeviceMesh(self.device_type, torch.arange(self.world_size))
+        from torch.distributed._functional_collectives import AsyncCollectiveTensor

         # pass in tensor as inputs/outputs, create DTensor and run redistribute
         # (allgather collective) inside the fn
         def fn(x):
             dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)
-            return dt.redistribute(mesh, [Replicate()]).to_local() + 2
+            out = dt.redistribute(mesh, [Replicate()], async_op=True).to_local()
+            return out
+            if isinstance(out, AsyncCollectiveTensor):
+                return out.wait()
+            else:
+                return out

         x = torch.ones(1)
         ref = fn(x)

bdhirsh · 2024-12-04T23:56:41Z

Fixed in core with pytorch/pytorch#142075

Make TP composable with torch.compile

54fa11c

kwen2501 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners December 4, 2024 20:05

This was referenced Dec 4, 2024

Dynamo doesn't handle branching on AsyncCollectiveTensor well #2353

Closed

Dynamo doesn't handle branching on AsyncCollectiveTensor well pytorch/pytorch#142076

Closed

merrymercy merged commit d693ec0 into sgl-project:main Dec 5, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make torch TP composable with torch.compile #2352

Make torch TP composable with torch.compile #2352

kwen2501 commented Dec 4, 2024

bdhirsh commented Dec 4, 2024

bdhirsh commented Dec 4, 2024

Make torch TP composable with torch.compile #2352

Make torch TP composable with torch.compile #2352

Conversation

kwen2501 commented Dec 4, 2024

Motivation

Modifications

Checklist

bdhirsh commented Dec 4, 2024

bdhirsh commented Dec 4, 2024