[Core][1/N] Support PP PyNCCL Groups #4988

andoorve · 2024-05-22T18:13:36Z

This PR adds supports multiple PP groups. Will help #4412 support CUDAGraph.

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

andoorve · 2024-05-22T18:46:49Z

@youkaichao @simon-mo Can you take a quick look at this PR?

youkaichao · 2024-05-22T23:11:34Z

tests/distributed/test_pynccl.py

@@ -151,6 +152,68 @@ def test_pynccl_with_cudagraph():
    distributed_run(worker_fn_with_cudagraph, 2)


+@worker_fn_wrapper
+def pp_worker_fn():


name it as send_recv worker? it is not testing pp though.

Ok will change - this is following convention from here:

https://github.com/andoorve/vllm/blob/8824d36656e52507f341904df23c9475aaf8c0e7/tests/distributed/test_pynccl.py#L72

you can also change that into allreduce worker.

youkaichao · 2024-05-22T23:13:45Z

vllm/distributed/device_communicators/pynccl.py

+        # Hardcoded to send to the next rank for now for simplicity.
+        dst = (self.rank + 1) % self.world_size


why hardcode here?

It's hardcoded as we only need to send to the next and prev rank for now so it's simple. We can change this if necessary though

please don't hardcode here. just expose the general form of send/recv. You can add default arg.

Changed it to expose general form with default arg None, which will make it next and prev rank respectively.

Let me know if that's sufficient

youkaichao · 2024-05-22T23:14:03Z

vllm/distributed/device_communicators/pynccl.py

+        # Hardcoded to receive from the previous rank for now for simplicity.
+        src = (self.rank - 1) % self.world_size


same as above

youkaichao · 2024-05-22T23:15:51Z

vllm/distributed/parallel_state.py

+    _PP_PYNCCL_COMMUNICATOR = PyNcclCommunicator(
+        group=_PP_CPU_GROUP,
+        device=_LOCAL_RANK,
+    )
+


creating communicators will add memory cost. please add a check here, if pp size == 1, skip the construction of PyNcclCommunicator.

Changed both PP and TP.

@youkaichao Do you know how much extra memory the communicators would be adding? It is a fixed cost or scales with group size?

We are running into some OOM with cudagraph enabled that shouldn't be happening. Any pointers would be greatly appreciated!

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

youkaichao

looks good to me overall. thanks! let's see if it passes the tests.

andoorve · 2024-05-23T16:06:42Z

Hey @youkaichao looks like it passed, can this be merged?

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

andoorve added 3 commits May 22, 2024 17:35

Add PP PyNCCL to make up for new PyTorch version

166effa

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

Add test changes

4b805ed

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

Minor update to test

8824d36

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

youkaichao reviewed May 22, 2024

View reviewed changes

Address review comments

6e9643c

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

youkaichao approved these changes May 23, 2024

View reviewed changes

youkaichao merged commit 5eda2ea into vllm-project:main May 23, 2024
63 checks passed

andoorve deleted the pp-pynccl branch May 23, 2024 17:08

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 31, 2024

[Core][1/N] Support send/recv in PyNCCL Groups (vllm-project#4988)

e8ccf26

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 8, 2024

[Core][1/N] Support send/recv in PyNCCL Groups (vllm-project#4988)

bf4c411

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024

[Core][1/N] Support send/recv in PyNCCL Groups (vllm-project#4988)

b0e20f3

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 14, 2024

[Core][1/N] Support send/recv in PyNCCL Groups (vllm-project#4988)

7306301

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Core][1/N] Support send/recv in PyNCCL Groups (vllm-project#4988)

da7adc2

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][1/N] Support PP PyNCCL Groups #4988

[Core][1/N] Support PP PyNCCL Groups #4988

andoorve commented May 22, 2024

andoorve commented May 22, 2024

youkaichao May 22, 2024

andoorve May 22, 2024

youkaichao May 22, 2024

andoorve May 22, 2024

youkaichao May 22, 2024

andoorve May 22, 2024

youkaichao May 22, 2024

andoorve May 22, 2024

andoorve May 22, 2024

youkaichao May 22, 2024

youkaichao May 22, 2024

andoorve May 22, 2024

SolitaryThinker May 25, 2024

youkaichao left a comment

andoorve commented May 23, 2024

		# Hardcoded to send to the next rank for now for simplicity.
		dst = (self.rank + 1) % self.world_size

		# Hardcoded to receive from the previous rank for now for simplicity.
		src = (self.rank - 1) % self.world_size

[Core][1/N] Support PP PyNCCL Groups #4988

[Core][1/N] Support PP PyNCCL Groups #4988

Conversation

andoorve commented May 22, 2024

andoorve commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youkaichao left a comment

Choose a reason for hiding this comment

andoorve commented May 23, 2024