-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support DP MLA #1970
Support DP MLA #1970
Conversation
@ispobock How much performance improvement is expected? Is it mainly in throughput or latency? |
@fengyang95 There is an issue with dp 8. I will test the performance once the issue fixed. It's mainly for throughput. |
@ispobock hi, when will support for cuda graph be planned? It is critical for latency improvement. |
I will support it soon. The code is almost done and needs some tests. |
@ispobock How much additional VRAM would this require approximately? |
@fengyang95 For the 236B V2 model, if use DP attention, the total weights will take ~570G, so prefer to use FP8 quantized model for better performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
@ispobock Does this support W4A16? My VRAM is very limited, and even using fp8 VRAM is not enough. |
@fengyang95 AWQ is supported in #2364. |
Motivation
Support data parallel on MLA for DeepSeek model to reduce replicated KV cache.
Modifications
--enable-dp-attention
option. When it is turned on, DP and TP share the same workers.IDLE
forward mode for workers that do not have sequence to forward but need TP sync with other workers.Performance
Compared to the main branch, this PR improves the prefill throughput by 20% and the decode throughput by 67% for DeepSeek-V2 model on 8*H100.
DP+TP (this PR):
TP (main branch):
Reproduce:
TODO