Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DP MLA #1970

Merged
merged 16 commits into from
Nov 16, 2024
Merged

Support DP MLA #1970

merged 16 commits into from
Nov 16, 2024

Conversation

ispobock
Copy link
Collaborator

@ispobock ispobock commented Nov 9, 2024

Motivation

Support data parallel on MLA for DeepSeek model to reduce replicated KV cache.

python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --trust-remote-code --tp 8 --dp 8 --enable-dp-attention

Modifications

  • Add --enable-dp-attention option. When it is turned on, DP and TP share the same workers.
  • Add IDLE forward mode for workers that do not have sequence to forward but need TP sync with other workers.
  • Implement model forward with DP attention + TP MoE.

Performance

Compared to the main branch, this PR improves the prefill throughput by 20% and the decode throughput by 67% for DeepSeek-V2 model on 8*H100.

DP+TP (this PR):

  • prefill: 21658.78 toks/s
  • decode: 11174.62 toks/s

TP (main branch):

  • prefill: 17941.92 toks/s
  • decode: 6656.75 toks/s

Reproduce:

# DP+TP
python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --disable-radix-cache --trust-remote-code --tp 8 --dp 8 --enable-dp-attention
# TP
python -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --disable-radix-cache --trust-remote-code --tp 8

# bench prefill
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 512 --random-output 1 --random-range-ratio 1 --num-prompts 10000
# bench decode
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 10000

TODO

@merrymercy merrymercy self-assigned this Nov 9, 2024
@fengyang95
Copy link

@ispobock How much performance improvement is expected? Is it mainly in throughput or latency?

@ispobock
Copy link
Collaborator Author

How much performance improvement is expected? Is it mainly in throughput or latency?

@fengyang95 There is an issue with dp 8. I will test the performance once the issue fixed. It's mainly for throughput.

@zhyncs zhyncs self-assigned this Nov 12, 2024
@ispobock ispobock changed the title [WIP] Support DP MLA Support DP MLA Nov 12, 2024
@fengyang95
Copy link

  • Compatible with cuda graph
  • Compatible with overlap mode

@ispobock hi, when will support for cuda graph be planned? It is critical for latency improvement.

python/sglang/srt/server_args.py Outdated Show resolved Hide resolved
python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved
python/sglang/srt/models/deepseek_v2.py Show resolved Hide resolved
python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved
@ispobock
Copy link
Collaborator Author

when will support for cuda graph be planned?

I will support it soon. The code is almost done and needs some tests.

@fengyang95
Copy link

when will support for cuda graph be planned?

I will support it soon. The code is almost done and needs some tests.

@ispobock How much additional VRAM would this require approximately?

@ispobock
Copy link
Collaborator Author

How much additional VRAM would this require approximately?

@fengyang95 For the 236B V2 model, if use DP attention, the total weights will take ~570G, so prefer to use FP8 quantized model for better performance.

Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@merrymercy merrymercy enabled auto-merge (squash) November 16, 2024 08:57
@merrymercy merrymercy merged commit 976bc30 into sgl-project:main Nov 16, 2024
13 checks passed
@merrymercy merrymercy mentioned this pull request Nov 24, 2024
37 tasks
@fengyang95
Copy link

@ispobock Does this support W4A16? My VRAM is very limited, and even using fp8 VRAM is not enough.

@ispobock ispobock mentioned this pull request Jan 7, 2025
5 tasks
@ispobock
Copy link
Collaborator Author

ispobock commented Jan 7, 2025

Does this support W4A16?

@fengyang95 AWQ is supported in #2364.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants