Support DP MLA #1970

ispobock · 2024-11-09T05:46:08Z

Motivation

Support data parallel on MLA for DeepSeek model to reduce replicated KV cache.

python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --trust-remote-code --tp 8 --dp 8 --enable-dp-attention

Modifications

Add --enable-dp-attention option. When it is turned on, DP and TP share the same workers.
Add IDLE forward mode for workers that do not have sequence to forward but need TP sync with other workers.
Implement model forward with DP attention + TP MoE.

Performance

Compared to the main branch, this PR improves the prefill throughput by 20% and the decode throughput by 67% for DeepSeek-V2 model on 8*H100.

DP+TP (this PR):

prefill: 21658.78 toks/s
decode: 11174.62 toks/s

TP (main branch):

prefill: 17941.92 toks/s
decode: 6656.75 toks/s

Reproduce:

# DP+TP
python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --disable-radix-cache --trust-remote-code --tp 8 --dp 8 --enable-dp-attention
# TP
python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --disable-radix-cache --trust-remote-code --tp 8

# bench prefill
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 512 --random-output 1 --random-range-ratio 1 --num-prompts 10000
# bench decode
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 10000

TODO

Compatible with cuda graph: Support cuda graph for DP attention #2061
Compatible with overlap mode

fengyang95 · 2024-11-11T02:13:20Z

@ispobock How much performance improvement is expected? Is it mainly in throughput or latency?

ispobock · 2024-11-11T13:53:35Z

How much performance improvement is expected? Is it mainly in throughput or latency?

@fengyang95 There is an issue with dp 8. I will test the performance once the issue fixed. It's mainly for throughput.

fengyang95 · 2024-11-13T12:44:48Z

Compatible with cuda graph

Compatible with overlap mode

@ispobock hi, when will support for cuda graph be planned? It is critical for latency improvement.

python/sglang/srt/server_args.py

python/sglang/srt/models/deepseek_v2.py

python/sglang/srt/managers/scheduler.py

python/sglang/srt/managers/schedule_batch.py

python/sglang/srt/managers/scheduler.py

ispobock · 2024-11-14T00:30:58Z

when will support for cuda graph be planned?

I will support it soon. The code is almost done and needs some tests.

fengyang95 · 2024-11-14T11:15:57Z

when will support for cuda graph be planned?

I will support it soon. The code is almost done and needs some tests.

@ispobock How much additional VRAM would this require approximately?

ispobock · 2024-11-15T18:34:02Z

How much additional VRAM would this require approximately?

@fengyang95 For the 236B V2 model, if use DP attention, the total weights will take ~570G, so prefer to use FP8 quantized model for better performance.

merrymercy

Great work!

fengyang95 · 2024-12-02T07:57:12Z

@ispobock Does this support W4A16? My VRAM is very limited, and even using fp8 VRAM is not enough.

ispobock · 2025-01-07T06:51:22Z

Does this support W4A16?

@fengyang95 AWQ is supported in #2364.

ispobock added 5 commits November 9, 2024 12:55

add args and share worker

d6d5ed4

update head num

9962b66

add idle batch

ee70ff3

update model

d1cdd90

fix dist

63525b2

ispobock requested review from merrymercy, Ying1123, hnyls2002, zhyncs and ByronHsu as code owners November 9, 2024 05:46

ispobock added 3 commits November 9, 2024 14:59

fix lint

cb23a72

fix group

d7e79da

fix logits

94091e8

merrymercy self-assigned this Nov 9, 2024

zhyncs self-assigned this Nov 12, 2024

ispobock and others added 2 commits November 13, 2024 00:32

Merge branch 'main' into dp-mla

6c05bf3

set chunked prefill size

48086c7

ispobock changed the title ~~[WIP] Support DP MLA~~ Support DP MLA Nov 12, 2024

Merge branch 'main' into dp-mla

d04b89f

merrymercy reviewed Nov 13, 2024

View reviewed changes

python/sglang/srt/server_args.py Outdated Show resolved Hide resolved

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

python/sglang/srt/models/deepseek_v2.py Show resolved Hide resolved

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

merrymercy reviewed Nov 13, 2024

View reviewed changes

python/sglang/srt/managers/schedule_batch.py Outdated Show resolved Hide resolved

merrymercy reviewed Nov 13, 2024

View reviewed changes

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

merrymercy reviewed Nov 13, 2024

View reviewed changes

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

merrymercy mentioned this pull request Nov 14, 2024

[Feature] Expert parallelism support #1435

Closed

2 tasks

update dp

68150db

ispobock and others added 3 commits November 15, 2024 18:00

add default settings

83f85a4

add test

a79dd89

Merge branch 'main' into dp-mla

3c0f1cd

merrymercy approved these changes Nov 16, 2024

View reviewed changes

Merge branch 'main' into dp-mla

064eb79

merrymercy enabled auto-merge (squash) November 16, 2024 08:57

merrymercy merged commit 976bc30 into sgl-project:main Nov 16, 2024
13 checks passed

ispobock mentioned this pull request Nov 17, 2024

Support cuda graph for DP attention #2061

Merged

merrymercy mentioned this pull request Nov 24, 2024

Development Roadmap (2024 Q4) #1487

Open

37 tasks

ispobock mentioned this pull request Jan 7, 2025

[Bug] launch dsv2 service failed #2763

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DP MLA #1970

Support DP MLA #1970

ispobock commented Nov 9, 2024 •

edited by merrymercy

Loading

fengyang95 commented Nov 11, 2024

ispobock commented Nov 11, 2024

fengyang95 commented Nov 13, 2024

ispobock commented Nov 14, 2024

fengyang95 commented Nov 14, 2024

ispobock commented Nov 15, 2024

merrymercy left a comment •

edited

Loading

fengyang95 commented Dec 2, 2024

ispobock commented Jan 7, 2025

Support DP MLA #1970

Support DP MLA #1970

Conversation

ispobock commented Nov 9, 2024 • edited by merrymercy Loading

Motivation

Modifications

Performance

TODO

fengyang95 commented Nov 11, 2024

ispobock commented Nov 11, 2024

fengyang95 commented Nov 13, 2024

ispobock commented Nov 14, 2024

fengyang95 commented Nov 14, 2024

ispobock commented Nov 15, 2024

merrymercy left a comment • edited Loading

Choose a reason for hiding this comment

fengyang95 commented Dec 2, 2024

ispobock commented Jan 7, 2025

ispobock commented Nov 9, 2024 •

edited by merrymercy

Loading

merrymercy left a comment •

edited

Loading