Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLA prefill w/o weight absorption #2349

Merged
merged 4 commits into from
Dec 4, 2024
Merged

Conversation

ispobock
Copy link
Collaborator

@ispobock ispobock commented Dec 4, 2024

Motivation

For large batch sizes, not using weight absorption in the MLA prefill phase will be more efficient.

python3 -m sglang.bench_one_batch --batch-size 128 --input 512 --output 1 --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --trust-remote-code --tp 8 --disable-cuda-graph

# prefill w/o absorb (this PR)
Prefill. latency: 2.17135 s, throughput:  30182.18 token/s

# prefill w/ absorb (main)
Prefill. latency: 3.29770 s, throughput:  19873.24 token/s

prefill w/o absorb (this PR)
image

prefill w/ absorb (main)
image

ref: #2203

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs
Copy link
Member

zhyncs commented Dec 4, 2024

QQ Why disable-cuda-graph and not enable-dp-attention? And what is the throughput and latency like in small batch size?

@ispobock
Copy link
Collaborator Author

ispobock commented Dec 4, 2024

Why disable-cuda-graph and not enable-dp-attention? And what is the throughput and latency like in small batch size?

  • We only test prefill here, while cuda graph is only used for decoding.
  • The prefill performance difference becomes significantly noticeable only with large batch sizes. However, in our serving benchmarks, we employed chunked prefill, with default chunk sizes of 8192 for TP and 4096 for DP attention (due to the workload issues with FusedMoE), so we cannot test large prefill batch. For this test, I evaluated only one batch, which I believe is sufficient to illustrate the performance difference. Once MoE Expert Parallel Impl #2203 merged, I will present results using larger chunk sizes on DP + EP configurations.

@zhyncs
Copy link
Member

zhyncs commented Dec 4, 2024

Why disable-cuda-graph and not enable-dp-attention? And what is the throughput and latency like in small batch size?

  • We only test prefill here, while cuda graph is only used for decoding.
  • The prefill performance difference becomes significantly noticeable only with large batch sizes. However, in our serving benchmarks, we employed chunked prefill, with default chunk sizes of 8192 for TP and 4096 for DP attention (due to the workload issues with FusedMoE), so we cannot test large prefill batch. For this test, I evaluated only one batch, which I believe is sufficient to illustrate the performance difference. Once MoE Expert Parallel Impl #2203 merged, I will present results using larger chunk sizes on DP + EP configurations.

What I mean here is not only performance, but also compatibility, all features can be compatible.

@zhyncs zhyncs merged commit ec52464 into sgl-project:main Dec 4, 2024
15 checks passed
@liangzelang
Copy link

Hi, @ispobock
great job, and I have tested deepseek-v2 which prefill is faster.
But in the modified code, during the prefill stage, it expands into a normal MHA (Multi-Head Attention) without using weight absorption, In fact the change is that the MLA prefill does not use matrix absorption. So what is the meaning of this PR title, or did I misunderstand something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants