-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] DeepSeek-V3 Enhancements #11539
Comments
If I want to deploy deepseek 600B use vllm and RTX4090, are there any restrictions? How many RTX 4090 do I need at least? |
Is inference with A100s supported? How about quantization?? |
Deepseek v3 doesn't appear to support pipeline parallelism. I get this error when attempting to deploy to 2 8x H100 nodes:
I'm using |
@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that. |
@simon-mo right, A100s don't support fp8. Would the arg --dtype bfloat16 suffice? If not, I found the bf16 version in Huggingface, any insights on whether that would work? |
The model currently does not support --dtype bfloat16 because it is natively trained in fp8. Can you point me to the bf16 version? |
@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main , on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6. |
vLLM does support this bf16 model on A100. It looks like the config.json properly removed |
Using v0.6.6 EDIT: Apologies, I was using 0.6.2. Redeploying helm chart with 0.6.6.post1. Will see how it goes. |
Any knowledge of a working example of serving deepseekv3 on A100s with vLLM? I'll try later, but any hints or help is very much appreciated |
Hi everyone,
Here’s the command I used:
Does anyone have suggestions or solutions for resolving this issue? Thanks in advance! |
I've had this problem, too. Is there a solution? |
Was getting this error- got resolved by removing cpu offloading... hoping for an explanation. Also, any suggestions to increase token throughput & context length. Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..? |
Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining |
we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions? |
I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing Here's vLLM vs SGLang at concurrency=64 atm: Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing. |
Same issue. I used 16 H100 GPUs, set TP=16, deployed using ray in k8s, and opened the IB network. I made a simple curl request, input 10 tokens, and output 242 tokens. This curl test It took 44 seconds. Can anyone help me figure out why? |
does the perf issues related to the MOE opt ? it is not included in the current version.? |
@shaowei-su I'm using the bf16 version you linked. @lhl thank you for sharing this! I'm currently using tp=4 pp=6 as we're aiming for context lengths > 64k. |
for bs=1 SGLang outputs around 26 tok/s:
You should read the DeepSeek Technical Report in the infrastructure, they deploy in 320 GPU blocks w/ specialized/separated functions. That being said, there's certainly optimizations that can be made for "regular" inference. On vLLM, when doing throughput optimization, with some tuning I can generate >7000 tok/s on a single H100 node for a Llama 3 70B class model at c=512. DSv3 has about half the activations, and at c=512 sglang currently tops out at about 1100 tok/s on 2xH100 nodes (vLLM is about half of that). You could imagine that there might be a 5-10X in throughput optimization available based naively on activations/fwd pass. This is before spec decode like EAGLE or Medusa is factored in. |
@simon-mo Is there any way or plan to improve the speed of vllm on deepseek v3? Thanks a lot |
we also see 3 token/s on 16x H20 with TP=8,PP=2 |
When I tested TP=16 on GH200 nodes (FP8 version), I was getting ~7.1 t/s (single batch). Ironically, when I used TP=8 (max_model_len=2048 so it all fit), I was getting slightly faster, which seemed strange. One of the issues that might be slowing VLLM down is that one of the MoE specific CUDA kernels is hard-coded for DSv3 to force the use of Global memory, which is significantly slower than shared memory. This is due to the limited amount of shared memory available (dependent on the GPU model... for example, the H100 has 227KB of shared memory per block). I don't know how much effect this has for this specific kernel, but it likely has some consequence. Techniques like distributed shared memory (H100+ specific) might be able to be used, or only keeping the active experts in there... but unfortunately I don't know much about CUDA programming. Spent 2 days messing trying to implement the "active-expert only" approach, but only served to slow down to 4.5 t/s... |
您好。请问现在使用vllm部署,要支持tool call功能,应该使用哪个parser? |
I use vllm==0.6.6.post1 can support this feature? |
This issue tracks follow up enhancements after initial support for the Deepseek V3 model. Please feel free to chime in and contribute!
scoring_func
ande_correction_bias
The text was updated successfully, but these errors were encountered: