Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] DeepSeekV3 instructions don't work for multi-node H100 setup #2673

Open
5 tasks done
mycpuorg opened this issue Dec 31, 2024 · 12 comments
Open
5 tasks done

[Bug] DeepSeekV3 instructions don't work for multi-node H100 setup #2673

mycpuorg opened this issue Dec 31, 2024 · 12 comments

Comments

@mycpuorg
Copy link

mycpuorg commented Dec 31, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

All the steps and issues reported in issue #2658 apply here.
I am using the Docker setup described in the instructions on this page. I believe H100 are the most widely deployed across most cloud service providers right now (FWIW, I am using AWS SageMaker P5 instances but it should not matter here).

https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208

Can somebody please help?
Thanks

Reproduction

Follow instructions on https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208

Environment

Slurm env running multiple H100 nodes to serve DeepSeekv3.
Same result with 2 and 4 nodes (16 and 32 H100 GPUs)

@zhyncs
Copy link
Member

zhyncs commented Dec 31, 2024

Please paste your instructions here

@zhyncs
Copy link
Member

zhyncs commented Dec 31, 2024

# node 1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

The startup commands for both nodes need to use the IP of node 1. This command has been verified on multi-node setups on h20 and h800 without any issues.

@roG0d
Copy link
Contributor

roG0d commented Jan 2, 2025

Maybe this is helpful #2707

@aisensiy
Copy link

aisensiy commented Jan 6, 2025

It works for me. I setup two H800 nodes to host deepseek v3 sucessfully.

@LaoZhang-best
Copy link

# node 1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

两个节点的启动命令都需要使用节点 1 的 IP。此命令已在 h20 和 h800 上的多节点设置上进行了验证,没有任何问题。

hello, @zhyncs ,I use two H20 node(2×8), after 10 minutes of running the test script, the service throws a "watchdog time out" error (default is 300 seconds), vllm doesn't have this problem but vllm is slow

@zhyncs
Copy link
Member

zhyncs commented Jan 6, 2025

@LaoZhang-best I think H20 works well, may you help take a look @Lzhang-hub

@Lzhang-hub
Copy link
Contributor

Lzhang-hub commented Jan 6, 2025

# node 1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

两个节点的启动命令都需要使用节点 1 的 IP。此命令已在 h20 和 h800 上的多节点设置上进行了验证,没有任何问题。

hello, @zhyncs ,I use two H20 node(2×8), after 10 minutes of running the test script, the service throws a "watchdog time out" error (default is 300 seconds), vllm doesn't have this problem but vllm is slow

I also encountered the same problem:"watchdog time out". In my case, it caused when generate a very long context. I increase timeout for long context.

How much tokens generate in you case?

@LaoZhang-best
Copy link

# node 1
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 0 --trust-remote-code

# node 2
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init 10.0.0.1:5000 --nnodes 2 --node-rank 1 --trust-remote-code

两个节点的启动命令都需要使用节点 1 的 IP。此命令已在 h20 和 h800 上的多节点设置上进行了验证,没有任何问题。

hello, @zhyncs ,I use two H20 node(2×8), after 10 minutes of running the test script, the service throws a "watchdog time out" error (default is 300 seconds), vllm doesn't have this problem but vllm is slow

I also encountered the same problem:"watchdog time out". In my case, it caused when generate a very long context. I increase timeout for long context.

How much tokens generate in you case?

i start with docker, command is:

- node1
docker run -d --gpus all --name deepseek3-multi --restart always --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5 -e NCCL_DEBUG=TRACE --ipc host --network=host -v ~/.cache/huggingface:/root/.cache/huggingface -v /data1/model:/data1/model af.hikvision.com.cn/docker-proxy/lmsysorg/sglang:latest-srt  python3 -m sglang.launch_server --model /data1/model/DeepSeek-V3.0 --tp 16 --dist-init-addr 10.113.76.252:20000 --nnodes 2 --node-rank 0 --trust-remote-code --port 8000 --host 0.0.0.0

- node2
docker run -d --gpus all --name deepseek3-multi --restart always --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5 -e NCCL_DEBUG=TRACE --ipc host --network=host -v ~/.cache/huggingface:/root/.cache/huggingface -v /data1/model:/data1/model af.hikvision.com.cn/docker-proxy/lmsysorg/sglang:latest-srt  python3 -m sglang.launch_server --model /data1/model/DeepSeek-V3.0 --tp 16 --dist-init-addr 10.113.76.252:20000 --nnodes 2 --node-rank 1 --trust-remote-code --port 8000 --host 0.0.0.0

error of node1
image

error of node2
image

40 concurrent, one request around 1000~2000 token,token generation speed of around 20/s. I feel that the generation speed is slow

@Lzhang-hub
Copy link
Contributor

@LaoZhang-best
Yes, the model we deployed is also about 20 tokens/s. I think 40 concurrent is too large for the decoding speed 20token/s, which causes some requests to not complete in 300 seconds. Then server think it is hang and will crash to prevent hanging.

Looking forward sglang further optimization.

@LaoZhang-best
Copy link

@LaoZhang-best Yes, the model we deployed is also about 20 tokens/s. I think 40 concurrent is too large for the decoding speed 20token/s, which causes some requests to not complete in 300 seconds. Then server think it is hang and will crash to prevent hanging.

Looking forward sglang further optimization.

Looking forward further optimization : )

@LaoZhang-best
Copy link

@Lzhang-hub Hello bro, I used post5 to start deepseek(two h20 16 cards), and found that it still triggers the watchdog timeout (300sec), although it reasoning has become faster, but there is still a problem

@LaoZhang-best
Copy link

@Lzhang-hub Hello bro, I used post5 to start deepseek(two h20 16 cards), and found that it still triggers the watchdog timeout (300sec), although it reasoning has become faster, but there is still a problem

When I use post4 the server throw an exception: watchdog timeout. Although post5 does not throw an exception, it is still blocking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants