-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Deepseek V2 MLA #10927
base: main
Are you sure you want to change the base?
[WIP] Deepseek V2 MLA #10927
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
…gging fi mla kernel issue
return self.forward_decode(positions, hidden_states, kv_cache, | ||
attn_metadata) | ||
|
||
def forward_prefill( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does flashinfer have prefill kernel?
Nice job! And I wonder how do you to solve MLA prefill kernel because there is no avaiable MLA prefill kernel but only decode kernel in flashinfer library. |
@liangzelang this PR will perform the regular up projection to turn MLA into MHA for prefill. |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Update:
Then it will be ready for review |
Signed-off-by: simon-mo <[email protected]>
I think the accuracy issues of the FlashInfer kernel might be related to this: |
Co-authored-by: cennn <[email protected]> Signed-off-by: simon-mo <[email protected]>
Signed-off-by: simon-mo <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Thanks to @cennn, the accuracy issue has been partially identified. We are now at a point the kernel generate coherent output. However, the accuracy is still lower than that of MHA implementation.
|
Status (12/05/2024):
Currently, I have implemented MLA in KV cache format and utilized FlashInfer's MLA decode kernel, with correct output. The throughput for a sample case already goes from 10.47 rps to 18.5 rps . The PR is still very messy and lack proper design but we demonstrated space savings and speed up.
Before
Some todos:
Figure out CUDA graph issue.Will just opt it out for now (also turn off chunked prefill)--disable-mla
and DISABLE_MLASome out of scope: