-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] DeepSeekV3 instructions don't work for multi-node H100 setup #2673
Comments
Please paste your instructions here |
The startup commands for both nodes need to use the IP of node 1. This command has been verified on multi-node setups on h20 and h800 without any issues. |
Maybe this is helpful #2707 |
It works for me. I setup two H800 nodes to host deepseek v3 sucessfully. |
hello, @zhyncs ,I use two H20 node(2×8), after 10 minutes of running the test script, the service throws a "watchdog time out" error (default is 300 seconds), vllm doesn't have this problem but vllm is slow |
@LaoZhang-best I think H20 works well, may you help take a look @Lzhang-hub |
I also encountered the same problem: How much tokens generate in you case? |
i start with docker, command is: - node1
docker run -d --gpus all --name deepseek3-multi --restart always --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5 -e NCCL_DEBUG=TRACE --ipc host --network=host -v ~/.cache/huggingface:/root/.cache/huggingface -v /data1/model:/data1/model af.hikvision.com.cn/docker-proxy/lmsysorg/sglang:latest-srt python3 -m sglang.launch_server --model /data1/model/DeepSeek-V3.0 --tp 16 --dist-init-addr 10.113.76.252:20000 --nnodes 2 --node-rank 0 --trust-remote-code --port 8000 --host 0.0.0.0
- node2
docker run -d --gpus all --name deepseek3-multi --restart always --shm-size 32g --privileged -e NCCL_IB_HCA=mlx5 -e NCCL_DEBUG=TRACE --ipc host --network=host -v ~/.cache/huggingface:/root/.cache/huggingface -v /data1/model:/data1/model af.hikvision.com.cn/docker-proxy/lmsysorg/sglang:latest-srt python3 -m sglang.launch_server --model /data1/model/DeepSeek-V3.0 --tp 16 --dist-init-addr 10.113.76.252:20000 --nnodes 2 --node-rank 1 --trust-remote-code --port 8000 --host 0.0.0.0 40 concurrent, one request around 1000~2000 token,token generation speed of around 20/s. I feel that the generation speed is slow |
@LaoZhang-best Looking forward sglang further optimization. |
Looking forward further optimization : ) |
@Lzhang-hub Hello bro, I used post5 to start deepseek(two h20 16 cards), and found that it still triggers the watchdog timeout (300sec), although it reasoning has become faster, but there is still a problem |
When I use post4 the server throw an exception: watchdog timeout. Although post5 does not throw an exception, it is still blocking |
Checklist
Describe the bug
All the steps and issues reported in issue #2658 apply here.
I am using the Docker setup described in the instructions on this page. I believe H100 are the most widely deployed across most cloud service providers right now (FWIW, I am using AWS SageMaker P5 instances but it should not matter here).
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208
Can somebody please help?
Thanks
Reproduction
Follow instructions on https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208
Environment
Slurm env running multiple H100 nodes to serve DeepSeekv3.
Same result with 2 and 4 nodes (16 and 32 H100 GPUs)
The text was updated successfully, but these errors were encountered: