[Feature] DeepSeek V3 optimization #2591

zhyncs · 2024-12-26T08:52:39Z

libratiger · 2024-12-26T10:38:29Z

Very quick response !
I understand that the overlap scheduler is model-independent and is a general optimization that should be supported by default.
At least some special optimizations are needed?

merrymercy · 2024-12-26T11:33:36Z

The overlap scheduler is model-independent but has not been supported when using dp attention. We have a private branch for this and will upstream it soon.

fengyang95 · 2024-12-26T13:27:00Z

Is the memory sufficient for an 8 gpus instance? This model size is too large.

zhyncs · 2024-12-26T15:10:45Z

Is the memory sufficient for an 8 gpus instance? This model size is too large.

671B works on H200 * 8 with FP8 (671 < 141 * 8)

zhyncs · 2024-12-26T16:39:53Z

Hi @fengyang95 You can also consider multi node.

If you do not have GPUs with large enough memory, please try multi-node tensor parallelism (help 1 help 2).

zhyncs · 2024-12-26T18:59:38Z

FYI Due to the tight schedule, SGLang v0.4.1 currently only provides preliminary support for DeepSeek V3. To make it run more cost-efficiently, we need to complete most of the optimizations mentioned above. If you are interested in any of the above optimizations, feel free to join the SGLang Slack for discussions or contribute a PR. We hope to complete these optimizations quickly and appreciate any discussion and contributions.

zhyncs · 2024-12-27T17:47:27Z

Update: SGLang v0.4.1.post1 supports CUDA Graph for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post1" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

zhyncs · 2024-12-29T16:13:17Z

Update: SGLang v0.4.1.post2 supports FP8 GEMM Tuning for DeepSeek V3, please use the latest version.

pip install "sglang[all]==0.4.1.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer

zhyncs · 2024-12-30T05:44:52Z

ref #2647

CSEEduanyu · 2025-01-06T05:46:53Z

plan to support mtp?

zhyncs · 2025-01-06T18:04:53Z

plan to support mtp?

It's on the roadmap and it's named nextn. We'll support it soon.

lixiaolx · 2025-01-08T02:46:05Z

@zhyncs @Ying1123 @merrymercy ,hello,
As you mentioned above, TP+DP,

TP+DP Attention @Ying1123

I have two questions, could you help me answer them?

1.Can we decouple TP and DP after this implementation? Can we configure the scenario where DP is not equal to TP?

2.Is there a detailed schedule for the mentioned above? Are there any related supporting design documents that can be shared?

Mutinifni · 2025-01-10T21:27:09Z

I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.

+1 for shared design docs, if possible.

pipul · 2025-01-13T07:30:49Z

I had another question regarding DP attention. The sglang blog mentions that DP attention is effective because of the MLA has only 1 KV head, which causes unnecessary duplication of KV caches. DeepSeek-V3 MLA has more KV heads (16 attention, 128 KV), so do we still replicate KV caches if just using something like TP8 or TP16? I understand there might not be sufficient heads if the deployment is large.

+1 for shared design docs, if possible.

@zhyncs @Mutinifni

https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json
"num_key_value_heads": 128,
DeepSeek-V3 has 128 KV heads??

min-xu-et · 2025-01-14T18:56:03Z

Are there any data related to inference time batch size and token imbalance between experts? What's the total throughput like for a 8xH200 node?

CSEEduanyu · 2025-01-19T14:09:38Z

Has there been any progress with the support from NextN?

lambert0312 · 2025-01-22T06:31:27Z

The overlap scheduler with DP attention can not be used on A800 * 4., because always OOM.

MtFitzRoy · 2025-01-22T10:31:25Z

Is there a plan to support TP + SP attention?

The paper says "The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP)"

wuyaoxuehun · 2025-01-26T08:46:52Z

Can you support deepseek R1 Q4KM GGUF file，https://huggingface.co/unsloth/DeepSeek-R1-GGUF

Xu-Chen · 2025-01-27T06:47:55Z

Any process about nextn speculative decoding?

Neo9061 · 2025-02-03T17:36:18Z

Hi @ispobock, wonder if you have any timeline to share nextn (speculative decoding) will be supported? thanks

ispobock · 2025-02-04T05:48:40Z

We have finished spec module refactor and will support nextn in the next 1~2 weeks.

Neo9061 · 2025-02-04T21:44:00Z

We have finished spec module refactor and will support nextn in the next 1~2 weeks.

Thanks! I wonder if your implementation will include any mechanism to generate the acceptance rates of the MTP head?

lambert0312 · 2025-02-05T02:03:32Z

DeepSeek MTP spec decode #12755 is Implement DeepSeek MTP: vllm-project/vllm#12181 to support DeepSeek MTP layers for next n prediction.

yukavio · 2025-02-05T09:22:32Z

Does sglang now support deepseekv3 inference with EP>1? When I added --enable-ep-moe to the command to start the service, I found that the process would hang. I'm not sure if this is a problem caused by my environment or if this feature is not currently supported.

jianglan89 · 2025-02-05T15:04:21Z

Does sglang now support deepseekv3 inference with EP>1? When I added --enable-ep-moe to the command to start the service, I found that the process would hang. I'm not sure if this is a problem caused by my environment or if this feature is not currently supported.

Me too, it seems not support ep-moe in 0.4.2

01lin · 2025-02-07T03:04:56Z

When will mtp(nextn speculative decoding) be supported?

lambert0312 · 2025-02-08T02:51:25Z

This is https://github.com/CentML's implementation of DeepSeek MTP modules that enable speculative decoding for DeepSeek-R1. vllm-project/vllm#12915

ltm920716 · 2025-02-08T09:02:09Z

hi,
I am new to sglang，and I need help about deploying deepseek on two nodes：
https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-two-h2008-nodes-and-docker

I have two H100 nodes，what is the best setting at parameters for the throughout？

thanks

zhyncs · 2025-02-10T14:51:20Z

FYI You can check out the latest progress of DeepSeek V3/R1 nextn here. #3472
I believe the majority of the difficult tasks have been completed recently, and we will be launching this feature in the coming days. Please stay tuned.

iammeizu · 2025-02-11T08:38:16Z

I deploy deepseek-v3 on 8 * h20 cards using sglang latest version ，the model serving run success, but output text was a mess，like
“勞經於In學éToTheText》\n檢 we would“ like,畢竟以\n\n it\n\n馬國in les un, Un\n\n transitive.black!\n\n.\n\n 這in,Definitionive had-t E ,andopo en min,,\n 1R:ik\n\n Écrit \n\n 立方 ** ce ** duue ** m ** duidge \n, . en leé onlayeropo ** 主 etDE ** mitgr -: (了 due-ui ** DE ( ** permission theRE,TaskIN PDE \n\n Client é,dule En Info MIN WEB，ZA}\n\n LeUseKK,…\n\n\n\nBLnéSO estONOtherlate ** Cl缶 r- traditional-- Establishmentrée耶 - ù parLE:* =DRval N queyreatek ** B\n\n ** un est être Un liqu** (Ô le �- ** Mine est-mer Merr** AN2 In être processed $Break ** GA IN**-info,ANNі Gö”

what should i do ? （no error log was found） @zhyncs

iammeizu · 2025-02-11T08:41:35Z

I also tried vllm to run both deepseek-v3 and r1 on h20s，the output text was similar，how to debug this problem？

zhyncs · 2025-02-11T08:44:00Z

I deploy deepseek-v3 on 8 * h20 cards using sglang latest version ，the model serving run success, but output text was a mess，like “勞經於In學éToTheText》\n檢 we would“ like,畢竟以\n\n it\n\n馬國in les un, Un\n\n transitive.black!\n\n.\n\n 這in,Definitionive had-t E ,andopo en min,,\n 1R:ik\n\n Écrit \n\n 立方 ** ce ** duue ** m ** duidge \n, . en leé onlayeropo ** 主 etDE ** mitgr -: (了 due-ui ** DE ( ** permission theRE,TaskIN PDE \n\n Client é,dule En Info MIN WEB，ZA}\n\n LeUseKK,…\n\n\n\nBLnéSO estONOtherlate ** Cl缶 r- traditional-- Establishmentrée耶 - ù parLE:* =DRval N queyreatek ** B\n\n ** un est être Un liqu** (Ô le �- ** Mine est-mer Merr** AN2 In être processed $Break ** GA IN**-info,ANNі Gö”

what should i do ? （no error log was found） @zhyncs

@iammeizu You should raise an issue separately. https://github.com/sgl-project/sglang/issues/new?template=1-bug-report.yml

yinfan98 · 2025-02-11T13:17:48Z

I also tried vllm to run both deepseek-v3 and r1 on h20s，the output text was similar，how to debug this problem？

I run deepseek-v3 with 8 * H20, it works well. @iammeizu

yinfan98 · 2025-02-11T13:54:55Z

I also test gsm8k and mmlu, you can refer this issue #3486.

iammeizu · 2025-02-12T01:51:00Z

thks for reply，I solved this problem by check model file compeletion (some safetensor missing , but sglang or vllm don't report error ) @yinfan98 @zhyncs

groklab · 2025-02-12T16:33:02Z

Hi!

How can I deploy DeepSeek V3 on 4 nodes with 14 H100s?

The setup:

node 1

node 2

node 3

node 4

My commands are:

# node 1
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 0 --trust-remote-code --enable-torch-compile --host 0.0.0.0 --port 30000 --grammar-backend xgrammar

# node 2
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 1 --trust-remote-code --enable-torch-compile --grammar-backend xgrammar

# node 3
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 2 --trust-remote-code --enable-torch-compile --grammar-backend xgrammar

# node 4
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 3 --trust-remote-code --enable-torch-compile --grammar-backend xgrammar

Then got error: AssertionError: tp_size must be divisible by number of nodes

Is there a workaround?

xixingya · 2025-02-13T07:52:18Z

Hi!

How can I deploy DeepSeek V3 on 4 nodes with 14 H100s?

The setup:

node 1

node 2

node 3

node 4

My commands are:

# node 1
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 0 --trust-remote-code --enable-torch-compile --host 0.0.0.0 --port 30000 --grammar-backend xgrammar

# node 2
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 1 --trust-remote-code --enable-torch-compile --grammar-backend xgrammar

# node 3
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 2 --trust-remote-code --enable-torch-compile --grammar-backend xgrammar

# node 4
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 3 --trust-remote-code --enable-torch-compile --grammar-backend xgrammar

Then got error: AssertionError: tp_size must be divisible by number of nodes

Is there a workaround?

Hi!

How can I deploy DeepSeek V3 on 4 nodes with 14 H100s?

The setup:

node 1

node 2

node 3

node 4

My commands are:

# node 1
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 0 --trust-remote-code --enable-torch-compile --host 0.0.0.0 --port 30000 --grammar-backend xgrammar

# node 2
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 1 --trust-remote-code --enable-torch-compile --grammar-backend xgrammar

# node 3
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 2 --trust-remote-code --enable-torch-compile --grammar-backend xgrammar

# node 4
python3 -m sglang.launch_server --model-path /models/dsv3 --tp 14 --dist-init-addr 10.21.99.116:5000 --nnodes 4 --node-rank 3 --trust-remote-code --enable-torch-compile --grammar-backend xgrammar

Then got error: AssertionError: tp_size must be divisible by number of nodes

Is there a workaround?

"intermediate size":18432.
quantization block n=128
18432/14 is not an integer, and it also needs to be satisfied that 18432/tp_size divided by 128 is an integer.

zhyncs · 2025-02-17T13:14:16Z

ref
MTP support: #3582
v0.4.3.post1 release: #3638

SGLang supports MTP (nextn) in the Triton backend, achieving a speed of 77 tokens/s, twice as fast as other OSS LLM engines.

ShivamB25 · 2025-02-17T21:46:03Z

@zhyncs can you please recommed the command for long prompt (lots of queries) since we are working with large documents. we want to run our own.
. system has 8 x mi300x , 2TB ram and dual amd process epyc 9654 (96cores). if that helps.
i am using this as of now

➜  ~ docker run -d \      
  --name sglang \
  --device=/dev/kfd \
  --device=/dev/dri \
  --security-opt seccomp=unconfined \
  --cap-add=SYS_PTRACE \
  --group-add video \
  --privileged \
  --shm-size 256g \
  --ipc=host \
  -p 3000:3000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HSA_NO_SCRATCH_RECLAIM=1 \
  jesselopezmicrosoft/sglang:mi300x \
  python3 -m sglang.launch_server \
    --model unsloth/DeepSeek-R1 \
    --tp 8 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 3000 \
    --enable-hierarchical-cache \
    --enable-cache-report \
    --chunked-prefill-size 8192 \
    --enable-metrics \
    --show-time-cost \
    --stream-output \
    --quantization fp8 \
    --attention-backend triton \
    --sampling-backend flashinfer \
    --grammar-backend xgrammar```

xutizhou · 2025-02-19T08:34:02Z

@zhyncs Is there any plan for optimizing the FP8 GEMM kernel? Recently, I’ve been working on some optimizations for DeepSeekV3 and achieved promising results. I’d love to contribute to the community if possible. Let me know how I can help!

YangQun1 · 2025-02-19T08:59:27Z

@zhyncs Is there any plan to support ep moe for deepseek-v3/r1? there are related issues: #3371
Ok, I find the feature issue. #2740

zhyncs · 2025-02-19T12:58:31Z

@zhyncs Is there any plan for optimizing the FP8 GEMM kernel? Recently, I’ve been working on some optimizations for DeepSeekV3 and achieved promising results. I’d love to contribute to the community if possible. Let me know how I can help!

@xutizhou sgl-kernel already supports the CUTLASS block wise fp8 https://github.com/sgl-project/sglang/blob/main/sgl-kernel/src/sgl-kernel/csrc/fp8_blockwise_gemm_kernel.cu

Please join the slack channel https://slack.sglang.ai

zhyncs · 2025-02-19T12:58:59Z

@zhyncs Is there any plan to support ep moe for deepseek-v3/r1? there are related issues: #3371 Ok, I find the feature issue. #2740

ref #3602

xutizhou · 2025-02-20T07:30:35Z

@zhyncs Is there any plan for optimizing the FP8 GEMM kernel? Recently, I’ve been working on some optimizations for DeepSeekV3 and achieved promising results. I’d love to contribute to the community if possible. Let me know how I can help!

@xutizhou sgl-kernel already supports the CUTLASS block wise fp8 https://github.com/sgl-project/sglang/blob/main/sgl-kernel/src/sgl-kernel/csrc/fp8_blockwise_gemm_kernel.cu

Please join the slack channel https://slack.sglang.ai

Thank you for the reference provided.

zhyncs added enhancement New feature or request performance quant LLM Quantization labels Dec 26, 2024

zhyncs assigned HaiShaw, merrymercy, ispobock, HandH1998, zhyncs and Ying1123 Dec 26, 2024

zhyncs pinned this issue Dec 26, 2024

zhyncs added the high priority label Dec 26, 2024

mowentian mentioned this issue Jan 2, 2025

How can I play with the speculative decoding which metioned in the paper? deepseek-ai/DeepSeek-V3#14

Closed

roG0d mentioned this issue Jan 7, 2025

Benchmark results for DeepSeek-v3 in 2x8xH200 Cluster #2738

Closed

3 tasks

This comment has been minimized.

Sign in to view

luzengxiangcn mentioned this issue Feb 7, 2025

If dp_size = tp_size is still required for deepseek model? #3359

Closed

[Feature] DeepSeek V3 optimization #2591

[Feature] DeepSeek V3 optimization #2591

Comments

zhyncs commented Dec 26, 2024 • edited Loading

Checklist

Adoption

Usage

Features

Related resources

libratiger commented Dec 26, 2024

merrymercy commented Dec 26, 2024 • edited Loading

fengyang95 commented Dec 26, 2024 • edited Loading

zhyncs commented Dec 26, 2024

zhyncs commented Dec 26, 2024

zhyncs commented Dec 26, 2024

zhyncs commented Dec 27, 2024

zhyncs commented Dec 29, 2024

zhyncs commented Dec 30, 2024

CSEEduanyu commented Jan 6, 2025

zhyncs commented Jan 6, 2025

lixiaolx commented Jan 8, 2025 • edited Loading

Mutinifni commented Jan 10, 2025

pipul commented Jan 13, 2025

min-xu-et commented Jan 14, 2025

CSEEduanyu commented Jan 19, 2025

lambert0312 commented Jan 22, 2025

MtFitzRoy commented Jan 22, 2025

wuyaoxuehun commented Jan 26, 2025

Xu-Chen commented Jan 27, 2025 • edited Loading

Neo9061 commented Feb 3, 2025

ispobock commented Feb 4, 2025

Neo9061 commented Feb 4, 2025

lambert0312 commented Feb 5, 2025

yukavio commented Feb 5, 2025

jianglan89 commented Feb 5, 2025

This comment has been minimized.

01lin commented Feb 7, 2025

lambert0312 commented Feb 8, 2025

ltm920716 commented Feb 8, 2025 • edited Loading

zhyncs commented Feb 10, 2025 • edited Loading

iammeizu commented Feb 11, 2025 • edited Loading

iammeizu commented Feb 11, 2025 • edited Loading

zhyncs commented Feb 11, 2025

yinfan98 commented Feb 11, 2025

yinfan98 commented Feb 11, 2025

iammeizu commented Feb 12, 2025

groklab commented Feb 12, 2025

xixingya commented Feb 13, 2025

zhyncs commented Feb 17, 2025

ShivamB25 commented Feb 17, 2025 • edited Loading

xutizhou commented Feb 19, 2025

YangQun1 commented Feb 19, 2025 • edited Loading

zhyncs commented Feb 19, 2025

zhyncs commented Feb 19, 2025

xutizhou commented Feb 20, 2025

zhyncs commented Dec 26, 2024 •

edited

Loading

merrymercy commented Dec 26, 2024 •

edited

Loading

fengyang95 commented Dec 26, 2024 •

edited

Loading

lixiaolx commented Jan 8, 2025 •

edited

Loading

Xu-Chen commented Jan 27, 2025 •

edited

Loading

ltm920716 commented Feb 8, 2025 •

edited

Loading

zhyncs commented Feb 10, 2025 •

edited

Loading

iammeizu commented Feb 11, 2025 •

edited

Loading

iammeizu commented Feb 11, 2025 •

edited

Loading

ShivamB25 commented Feb 17, 2025 •

edited

Loading

YangQun1 commented Feb 19, 2025 •

edited

Loading