[Bug] DeepSeekV3 instructions don't work for multi-node H100 setup #2673

mycpuorg · 2024-12-31T02:54:56Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

All the steps and issues reported in issue #2658 apply here.
I am using the Docker setup described in the instructions on this page. I believe H100 are the most widely deployed across most cloud service providers right now (FWIW, I am using AWS SageMaker P5 instances but it should not matter here).

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208

Can somebody please help?
Thanks

Reproduction

Follow instructions on https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208

Environment

Slurm env running multiple H100 nodes to serve DeepSeekv3.
Same result with 2 and 4 nodes (16 and 32 H100 GPUs)

zhyncs · 2024-12-31T05:24:30Z

Please paste your instructions here

zhyncs · 2024-12-31T06:17:45Z

# node 1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

The startup commands for both nodes need to use the IP of node 1. This command has been verified on multi-node setups on h20 and h800 without any issues.

roG0d · 2025-01-02T09:46:12Z

Maybe this is helpful #2707

aisensiy · 2025-01-06T04:02:04Z

It works for me. I setup two H800 nodes to host deepseek v3 sucessfully.

LaoZhang-best · 2025-01-06T07:14:45Z

# node 1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code
两个节点的启动命令都需要使用节点 1 的 IP。此命令已在 h20 和 h800 上的多节点设置上进行了验证，没有任何问题。

hello， @zhyncs ，I use two H20 node（2×8）, after 10 minutes of running the test script, the service throws a "watchdog time out" error (default is 300 seconds), vllm doesn't have this problem but vllm is slow

zhyncs · 2025-01-06T08:49:29Z

@LaoZhang-best I think H20 works well, may you help take a look @Lzhang-hub

Lzhang-hub · 2025-01-06T09:33:09Z

# node 1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code
两个节点的启动命令都需要使用节点 1 的 IP。此命令已在 h20 和 h800 上的多节点设置上进行了验证，没有任何问题。
hello， @zhyncs ，I use two H20 node（2×8）, after 10 minutes of running the test script, the service throws a "watchdog time out" error (default is 300 seconds), vllm doesn't have this problem but vllm is slow

I also encountered the same problem:"watchdog time out". In my case, it caused when generate a very long context. I increase timeout for long context.

How much tokens generate in you case?

LaoZhang-best · 2025-01-07T01:36:50Z

# node 1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code
两个节点的启动命令都需要使用节点 1 的 IP。此命令已在 h20 和 h800 上的多节点设置上进行了验证，没有任何问题。
hello， @zhyncs ，I use two H20 node（2×8）, after 10 minutes of running the test script, the service throws a "watchdog time out" error (default is 300 seconds), vllm doesn't have this problem but vllm is slow
I also encountered the same problem:"watchdog time out". In my case, it caused when generate a very long context. I increase timeout for long context.

How much tokens generate in you case?

i start with docker, command is:

- node1
docker run -d --gpus all --name deepseek3-multi --restart always --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5 -e NCCL_DEBUG=TRACE --ipc host --network=host -v ~/.cache/huggingface:/root/.cache/huggingface -v /data1/model:/data1/model af.hikvision.com.cn/docker-proxy/lmsysorg/sglang:latest-srt  python3 -m sglang.launch_server --model /data1/model/DeepSeek-V3.0 --tp 16 --dist-init-addr 10.113.76.252:20000 --nnodes 2 --node-rank 0 --trust-remote-code --port 8000 --host 0.0.0.0

- node2
docker run -d --gpus all --name deepseek3-multi --restart always --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5 -e NCCL_DEBUG=TRACE --ipc host --network=host -v ~/.cache/huggingface:/root/.cache/huggingface -v /data1/model:/data1/model af.hikvision.com.cn/docker-proxy/lmsysorg/sglang:latest-srt  python3 -m sglang.launch_server --model /data1/model/DeepSeek-V3.0 --tp 16 --dist-init-addr 10.113.76.252:20000 --nnodes 2 --node-rank 1 --trust-remote-code --port 8000 --host 0.0.0.0

error of node1

error of node2

40 concurrent, one request around 1000~2000 token，token generation speed of around 20/s. I feel that the generation speed is slow

Lzhang-hub · 2025-01-07T04:07:38Z

@LaoZhang-best
Yes, the model we deployed is also about 20 tokens/s. I think 40 concurrent is too large for the decoding speed 20token/s, which causes some requests to not complete in 300 seconds. Then server think it is hang and will crash to prevent hanging.

Looking forward sglang further optimization.

LaoZhang-best · 2025-01-07T10:27:49Z

@LaoZhang-best Yes, the model we deployed is also about 20 tokens/s. I think 40 concurrent is too large for the decoding speed 20token/s, which causes some requests to not complete in 300 seconds. Then server think it is hang and will crash to prevent hanging.

Looking forward sglang further optimization.

Looking forward further optimization : )

LaoZhang-best · 2025-01-17T02:16:05Z

@Lzhang-hub Hello bro, I used post5 to start deepseek(two h20 16 cards), and found that it still triggers the watchdog timeout (300sec), although it reasoning has become faster, but there is still a problem

LaoZhang-best · 2025-01-17T02:46:13Z

@Lzhang-hub Hello bro, I used post5 to start deepseek(two h20 16 cards), and found that it still triggers the watchdog timeout (300sec), although it reasoning has become faster, but there is still a problem

When I use post4 the server throw an exception: watchdog timeout. Although post5 does not throw an exception, it is still blocking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] DeepSeekV3 instructions don't work for multi-node H100 setup #2673

[Bug] DeepSeekV3 instructions don't work for multi-node H100 setup #2673

mycpuorg commented Dec 31, 2024 •

edited

Loading

zhyncs commented Dec 31, 2024

zhyncs commented Dec 31, 2024

roG0d commented Jan 2, 2025

aisensiy commented Jan 6, 2025

LaoZhang-best commented Jan 6, 2025

zhyncs commented Jan 6, 2025

Lzhang-hub commented Jan 6, 2025 •

edited

Loading

LaoZhang-best commented Jan 7, 2025

Lzhang-hub commented Jan 7, 2025

LaoZhang-best commented Jan 7, 2025

LaoZhang-best commented Jan 17, 2025

LaoZhang-best commented Jan 17, 2025

[Bug] DeepSeekV3 instructions don't work for multi-node H100 setup #2673

[Bug] DeepSeekV3 instructions don't work for multi-node H100 setup #2673

Comments

mycpuorg commented Dec 31, 2024 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

zhyncs commented Dec 31, 2024

zhyncs commented Dec 31, 2024

roG0d commented Jan 2, 2025

aisensiy commented Jan 6, 2025

LaoZhang-best commented Jan 6, 2025

zhyncs commented Jan 6, 2025

Lzhang-hub commented Jan 6, 2025 • edited Loading

LaoZhang-best commented Jan 7, 2025

Lzhang-hub commented Jan 7, 2025

LaoZhang-best commented Jan 7, 2025

LaoZhang-best commented Jan 17, 2025

LaoZhang-best commented Jan 17, 2025

mycpuorg commented Dec 31, 2024 •

edited

Loading

Lzhang-hub commented Jan 6, 2025 •

edited

Loading