MLA prefill w/o weight absorption #2349

ispobock · 2024-12-04T14:33:28Z

Motivation

For large batch sizes, not using weight absorption in the MLA prefill phase will be more efficient.

python3 -m sglang.bench_one_batch --batch-size 128 --input 512 --output 1 --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --trust-remote-code --tp 8 --disable-cuda-graph

# prefill w/o absorb (this PR)
Prefill. latency: 2.17135 s, throughput:  30182.18 token/s

# prefill w/ absorb (main)
Prefill. latency: 3.29770 s, throughput:  19873.24 token/s

prefill w/o absorb (this PR)

prefill w/ absorb (main)

ref: #2203

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhyncs · 2024-12-04T14:45:16Z

QQ Why disable-cuda-graph and not enable-dp-attention? And what is the throughput and latency like in small batch size?

ispobock · 2024-12-04T14:59:52Z

Why disable-cuda-graph and not enable-dp-attention? And what is the throughput and latency like in small batch size?

We only test prefill here, while cuda graph is only used for decoding.
The prefill performance difference becomes significantly noticeable only with large batch sizes. However, in our serving benchmarks, we employed chunked prefill, with default chunk sizes of 8192 for TP and 4096 for DP attention (due to the workload issues with FusedMoE), so we cannot test large prefill batch. For this test, I evaluated only one batch, which I believe is sufficient to illustrate the performance difference. Once MoE Expert Parallel Impl #2203 merged, I will present results using larger chunk sizes on DP + EP configurations.

python/sglang/srt/layers/attention/double_sparsity_backend.py

zhyncs · 2024-12-04T17:50:14Z

Why disable-cuda-graph and not enable-dp-attention? And what is the throughput and latency like in small batch size?

We only test prefill here, while cuda graph is only used for decoding.

The prefill performance difference becomes significantly noticeable only with large batch sizes. However, in our serving benchmarks, we employed chunked prefill, with default chunk sizes of 8192 for TP and 4096 for DP attention (due to the workload issues with FusedMoE), so we cannot test large prefill batch. For this test, I evaluated only one batch, which I believe is sufficient to illustrate the performance difference. Once MoE Expert Parallel Impl #2203 merged, I will present results using larger chunk sizes on DP + EP configurations.

What I mean here is not only performance, but also compatibility, all features can be compatible.

liangzelang · 2025-01-15T09:58:32Z

Hi, @ispobock
great job, and I have tested deepseek-v2 which prefill is faster.
But in the modified code, during the prefill stage, it expands into a normal MHA (Multi-Head Attention) without using weight absorption, In fact the change is that the MLA prefill does not use matrix absorption. So what is the meaning of this PR title, or did I misunderstand something?

ispobock added 2 commits December 4, 2024 00:36

forward prefill

61fd27c

no padding

a75f8eb

ispobock requested review from merrymercy, Ying1123, hnyls2002, zhyncs and ByronHsu as code owners December 4, 2024 14:33

Merge branch 'main' into mla-prefill

f6a9eb2

zhyncs added the high priority label Dec 4, 2024

fix backends

8c3b6de

zhyncs approved these changes Dec 4, 2024

View reviewed changes

zhyncs reviewed Dec 4, 2024

View reviewed changes

python/sglang/srt/layers/attention/double_sparsity_backend.py Show resolved Hide resolved

zhyncs merged commit ec52464 into sgl-project:main Dec 4, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLA prefill w/o weight absorption #2349

MLA prefill w/o weight absorption #2349

ispobock commented Dec 4, 2024

zhyncs commented Dec 4, 2024

ispobock commented Dec 4, 2024 •

edited

Loading

zhyncs commented Dec 4, 2024

liangzelang commented Jan 15, 2025

MLA prefill w/o weight absorption #2349

MLA prefill w/o weight absorption #2349

Conversation

ispobock commented Dec 4, 2024

Motivation

Checklist

zhyncs commented Dec 4, 2024

ispobock commented Dec 4, 2024 • edited Loading

zhyncs commented Dec 4, 2024

liangzelang commented Jan 15, 2025

ispobock commented Dec 4, 2024 •

edited

Loading