[Performance]: decoding speed on long context #11286

155394551lzk · 2024-12-18T06:59:09Z

Proposal to improve performance

In our experiments, we found that the decoding speed of vLLM decreases dramatically when the length of the prompt becomes longer.
We fixed the batchsize=90 the decoding speed is 5364 tokens/s when the length of the prompt is within 100, 5500 tokens/s when 100 to 200, and decreases to 782 when 4000 to 8000, and decreases to 273 when greater than 8000.

prompt length	0-100	100-200	200-500	500-1000	1000-2000	2000-4000	4000-8000	8000+
words/s	5364	5500	4722	2815	2484	1627	782	273

GPU is single A800, 80G, vLLM block_size=16, max_num_seqs=512, max_model_len=8192, max_tokens=200. Is that why page attention is accessed more often?

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

noooop · 2024-12-18T07:38:53Z

Most of the time, the GPU meet bandwidth bottleneck rather than computing bottleneck,
So the inference speed depends on the memory bandwidth.

As the prompt length increases, the size of the kv cache that needs to be read even exceeds the size of the model,
This is why inference speed is decrease.

Go further:

Inference latency increases linearly to the context size, primarily due to the time needed to access
cached tokens

You can even fit a linear function, x=kv cache size need read, y=time required for one step

see more flashdecoding

Flynn-Zh · 2024-12-19T02:52:44Z

GPU L40, Qwen2.5-32B-GPTQ-Int4
same question, prompt has 9k words, vllm takes 12 seconds, sglang 8 seconds.
Is there any configuration that can improve performance?

noooop · 2024-12-19T05:59:48Z

@Flynn-Zh

By default

vllm use flash_attn for decoding
sglang use flashinfer for decoding

flashinfer was slightly faster than flash_attn, but I'm not sure if that's still the case

You can try vllm + flashinfer see if that improve performance.

Looking forward to your benchmark

Flynn-Zh · 2024-12-19T06:16:20Z

@noooop

i can't found the configuration to use flashinfer, how to use flashinfer in vllm?

noooop · 2024-12-19T06:27:26Z

install flashinfer

I'm not sure vllm supports the latest released flashinfer v0.2.0 #11314

It is safer to use flashinfer v0.1.6

set environment variable VLLM_ATTENTION_BACKEND=FLASHINFER， enable flashinfer

Flynn-Zh · 2024-12-19T07:20:04Z

use flashinfer still 12 seconds

noooop · 2024-12-19T07:28:28Z

use flashinfer still 12 seconds

interesting

jeejeelee · 2024-12-19T07:35:36Z

Maybe it is a similar issue with #11317 (comment)

Flynn-Zh · 2024-12-19T09:35:10Z

@jeejeelee i try increase max-seq-len-to-capture，but it's useless

noooop · 2024-12-19T09:38:45Z

@Flynn-Zh

vllm v0 use default scheduler，chunked_prefill performs better for long inputs

Please try the configuration below:

enable_chunked_prefill = True
max_num_seqs=32
max_num_batched_tokens=2048 <- 2048 token can generally make the GPU reach saturation

Flynn-Zh · 2024-12-19T09:40:45Z

@noooop
I've also tried it，it's useless

jeejeelee · 2024-12-19T15:25:11Z

@jeejeelee i try increase max-seq-len-to-capture，but it's useless

Could you plz provide the more details, such as model ,running script, etc. I can try reproduce your issue if I have bandwith this weekend.

Flynn-Zh · 2024-12-20T03:53:03Z

@jeejeelee

GPU L40, Qwen2.5-32B-GPTQ-Int4 same question, prompt has 9k words, vllm takes 12 seconds, sglang 8 seconds. Is there any configuration that can improve performance?

i only use vscode + REST Client pulgin test v1/chat/completions，prompt is long content of the document，let LLM summarize，the maximum output length requirement is 500。

because there were some issues with sglang 0.4.0, I just tried sglang 0.3.2 again，it's takes 6s

JaheimLee · 2024-12-20T08:55:43Z

@jeejeelee

GPU L40, Qwen2.5-32B-GPTQ-Int4 same question, prompt has 9k words, vllm takes 12 seconds, sglang 8 seconds. Is there any configuration that can improve performance?

i only use vscode + REST Client pulgin test v1/chat/completions，prompt is long content of the document，let LLM summarize，the maximum output length requirement is 500。

because there were some issues with sglang 0.4.0, I just tried sglang 0.3.2 again，it's takes 6s

Have you tried “gptq” kernel? In my case, “gptq” kernel is faster than “marlin” kernel. I'm not sure whether it's a bug or not. My GPU is 3090

noooop · 2024-12-20T09:11:15Z

https://github.com/noooop/vllm/blob/f13a07b1f8c11ddbdc53b40f1fbb24bf3166b900/vllm/model_executor/layers/quantization/gptq.py#L242C1-L245C62

“gptq” kernel use gemm, This may be useful for large batch sizes.
neuralmagic blog show: marlin is better than GPT AWQ FP16 in all batch sizes

Kernel performance may be related to the device
I'm actually looking at how to use quantize+float16 in gptq，

like awq
https://github.com/noooop/vllm/blob/f13a07b1f8c11ddbdc53b40f1fbb24bf3166b900/vllm/model_executor/layers/quantization/awq.py#L164C1-L166C48

Test this by myself FP16_MATMUL_HEURISTIC_CONDITION = x.shape[:-1].numel() >= 256 is useful

Flynn-Zh · 2024-12-20T09:37:11Z

@jeejeelee @noooop I just tried gptq again，and it's basically the same as gptq_marlin

noooop · 2024-12-25T13:01:13Z

code

setting

hardware 4090*1
anaconda

Offline inference

prefills

input_len = 8000
output_len = 16
num_prompts = 11

max_num_batched_tokens	vllm 0.6.4 + gptq_marlin	vllm 0.6.4 + gptq	sglang 0.4.0.post2
1024	4.15	3.67	4.16
512	4.19	4.58	4.27
256	4.34	6.58	4.26
128	4.53	11.52	4.50
64	5.53	21.95	5.21
32	8.59	18.12	7.72

decoding

input_len = 8000
output_len = 512
num_prompts = 11

	decoding
vllm 0.6.4 + flash attention	16.74334423
vllm 0.6.4 + flashinfer	16.76823786
sglang 0.4.0.post2	16.09388748

conclusion

Offline inference, using chunked prefill, vllm and sglang are almost the same speed.
marlin (MarlinLinearKernel) work well Almost all max_num_batched_tokens.
gptq (ExllamaLinearKernel) Probably works well at >1024, but not much better.
There is almost no difference in speed between flashinfer and flash attention.
There is almost no difference in speed between vllm 0.6.4 and vllm 0.6.5

Situations not tested

vllm default scheduler (not using chunked prefill) oom on my 4090
MacheteLinearKernel requires capability 90, current (4090) compute capability is 89
Maybe vllm and sglang webserver have different speeds.
Maybe vllm and sglang have different output lengths.
Maybe hit some kind of cache

vllm and sglang use almost the same mlp and attentions implementations, this code has been optimized for years.
At least for offline testing, the speed can't be that much different.

I'm not very familiar with webserver and need other experts to help.

Flynn-Zh · 2024-12-26T02:04:24Z

test result, it can be stably reproduced，it's all the first time calling:

run server cmd:

hardware L40*1，vllm 0.6.5 and sglang 0.4.0.post2 use the same L40

noooop · 2024-12-26T05:53:12Z

test result, it can be stably reproduced，it's all the first time calling:

vllm output 283 tokens， use 117804 ms
sgl output 328 tokens， use 6226 ms

how could it happen？

@Flynn-Zh
Can you run an offline test?

https://github.com/noooop/snippet/blob/main/benchmarks/test_gptq/main.py

Flynn-Zh · 2024-12-26T08:11:05Z

result.txt
@noooop are some errors in executing main.py

Flynn-Zh · 2024-12-26T09:43:00Z

modify main.by and run offline test again, the result is：
result.txt

@noooop

noooop · 2024-12-26T09:55:07Z

@noooop are some errors in executing main.py

I'm very sorry, I added the unsupported parameter enforce_eager to sgl.Engine but didn't test it.

Summarize

@Flynn-Zh
modify main.by and run offline test again, the result is：
result.txt

hardware L40*1

Offline inference

prefills

input_len = 8000
output_len = 16
num_prompts = 11

using chunked prefill

batchsize	vllm + gptq_marlin	vllm + gptq	sglang 0.4.0.post2
1024	2.41	3.01	2.33
512	2.47	3.43	2.35
256	2.57	4.14	2.49
128	2.80	6.51	2.79
64	3.83	11.97	3.82
32	7.10	13.10	7.21

vllm default scheduler

method
gptq_marlin	2.33
gptq	2.35
gptq_marlin + enforce_eager	2.49
gptq + enforce_eager	2.79

decoding

input_len = 8000
output_len = 512
num_prompts = 11

	decoding
vllm chunked prefill = 1024 flash attention	15.50
vllm chunked prefill = 1024 flashinfer	15.86
vllm default scheduler + gptq_marlin	16.24
vllm default scheduler + gptq	15.88
vllm default scheduler + gptq_marlin + enforce_eager	16.07
vllm default scheduler + gptq + enforce_eager	16.07
sglang 0.4.0.post2	10.41

conclusion

for prefills: sglang is similar to vllm
for decoding: sglang 10.41 vs vllm (under all configurations) 15 ~ 16. Really faster.
for vllm

L40 864GB/s
4090 1008 GB/s

So 4090 prefill is slower than L40, but decoding is almost the same. very reasonable

I don't know why，but In the decoding stage, sglang is indeed faster than vllm

noooop · 2024-12-26T10:09:18Z

@jeejeelee
Come and take a look

5. I don't know why，but In the decoding stage, sglang is indeed faster than vllm

noooop · 2024-12-26T10:24:52Z

@Flynn-Zh

reference to LLM inference speed of light

Qwen2.5-32B-GPTQ-Int4 18G

18G / 864GB/s = 20ms

prefills 2.33 s + decoding 20ms * （512-16）= 12.33 s

This does not take into account kvcache

vLLM 15 ~ 16s is very reasonable
sglang 10.41s can't happen

Flynn-Zh · 2024-12-26T10:36:10Z

@Flynn-Zh

reference to LLM inference speed of light

Qwen2.5-32B-GPTQ-Int4 18G

18G / 864GB/s = 20ms

prefills 2.33 s + decoding 20ms * （512-16）= 12.33 s

This does not take into account kvcache

vLLM 15 ~ 16s is very reasonable

sglang 10.41s can't happen

What black technology does sglang have？

noooop · 2024-12-26T10:38:55Z

@Flynn-Zh
reference to LLM inference speed of light
Qwen2.5-32B-GPTQ-Int4 18G
18G / 864GB/s = 20ms
prefills 2.33 s + decoding 20ms * （512-16）= 12.33 s
This does not take into account kvcache

vLLM 15 ~ 16s is very reasonable

sglang 10.41s can't happen

What black technology does sglang have？

I've been thinking the same thing.
What black technology does sglang have？

speculative sampling ？
sliding window ？
The prefix cache can only be used in the prefills stage, pass

jeejeelee · 2024-12-27T01:11:22Z

@jeejeelee Come and take a look

I don't know why，but In the decoding stage, sglang is indeed faster than vllm

We can run a profiler to investigate it

Flynn-Zh · 2024-12-27T03:29:59Z

reference to LLM inference speed of light

@noooop What I understand is that the calculation method in this article is suitable for the MHA model, but qwen2.5 is the GQA model. I don't know if my understanding is correct?

noooop · 2024-12-27T03:46:21Z

GQA only affects the calculation of kv cache delay. We do not consider kv cache at all.

noooop · 2024-12-27T05:58:16Z

let's try NVIDIA Nsight profile

4090 profile.zip

input_len = 8000
output_len = 16
num_prompts = 1
chunked prefill size = 1024

for vllm

overall

prefill * 8 & decoding * 16. straightforward

Enlarge the decoding part

One decoding step takes 25ms, very reasonable

Summarize for for vllm Kernels

prefill Linear void marlin::Marlin..... ~445.982 μs [1]
prefill attention void flash_fwd_splitkv_kernel..... 114.336 μs ~ 512.510 μs [3]
decoding Linear void marlin::Marlin..... ~164.544 μs [2]
decoding attention void flash_fwd_splitkv_kernel..... ~43.168 μs [4]

for sglang

overall

Can only roughly see，prefill * 8 & decoding * 16

Enlarge the decoding part

One decoding step takes 24ms, very reasonable

Summarize for for vllm Kernels

Can not found cuda kernel information，
So there is no way to compare it with vllm

How to use NVIDIA Nsight profile

install

https://developer.nvidia.com/nsight-systems/get-started

Download for Linux on x86_64

Nsight Systems 2024.7.1 Full Version

Download .run Installer

apt install nsight-systems Don't work for me

profile

nsys profile -w true  -o vllm -f true -x true python test_vllm.py
nsys profile -w true  -o sgl -f true -x true python test_sgl.py

code

@Flynn-Zh

Looking forward to your profile

noooop · 2024-12-27T06:00:14Z

@jeejeelee

I'm not familiar with sglang. Is there any better way to profile the sglang?

Or do I need to add parameters to nsys profile ?

Bryce1010 · 2024-12-27T07:09:16Z

Keep an eye on it--I'm curious why it's happening. Does it only occur with Qwen2.5-32B-Instruct-GPTQ-Int4,or does it affect other models too?

noooop · 2024-12-27T07:24:43Z

Keep an eye on it--I'm curious why it's happening. Does it only occur with Qwen2.5-32B-Instruct-GPTQ-Int4,or does it affect other models too?

Yes, it's very strange.

I think the 4090 results are obviously reasonable, and the L40 results are very unreasonable.

I'm trying to determine how it was triggered.

sglang 10.41s feels like it is not read any kvcache in the decoding stage .

@Flynn-Zh
Please help me test how long a L40 sglang decoding step takes.

jeejeelee · 2024-12-27T08:47:07Z

@jeejeelee

I'm not familiar with sglang. Is there any better way to profile the sglang?

Or do I need to add parameters to nsys profile ?

I usually use torch.profiler

Flynn-Zh · 2024-12-27T13:42:06Z

my driver is 535.154

Flynn-Zh · 2024-12-27T14:46:37Z

Keep an eye on it--I'm curious why it's happening. Does it only occur with Qwen2.5-32B-Instruct-GPTQ-Int4,or does it affect other models too?

Yes, it's very strange.

I think the 4090 results are obviously reasonable, and the L40 results are very unreasonable.

I'm trying to determine how it was triggered.

sglang 10.41s feels like it is not read any kvcache in the decoding stage .

@Flynn-Zh Please help me test how long a L40 sglang decoding step takes.

sgl.zip
@noooop

noooop · 2024-12-28T08:04:50Z

sgl.zip

for sglang

overall

last prefill 218ms

decoding 15.6ms Awesome!

for vllm

overall

last prefill 317 ms ?

decoding 30 ms ?

vllm Kernels

prefill Linear exllama??? why not Marlin ??? [1]
prefill attention void flash_fwd_splitkv_kernel..... ok [3]
decoding Linear gptq:gmm??? why not Marlin ??? [2]
decoding attention void flash_fwd_splitkv_kernel..... ok [4]

conclusion

So I think L40 is slower because vllm does not use Marlin? but why？

L40 does not support Marlin？
wrong configuration？
bug？
There is a possibility that because of the driver version problem, vllm does not use Marlin but uses exllama, but why does sgl work?

noooop · 2024-12-28T08:14:10Z

Looking at the previous log, it seems that MarlinLinearKernel is supported.

What did we miss?

noooop · 2024-12-28T08:32:43Z

Is it caused by setting the quantization parameter?

quantization = "gptq_marlin"

I thought it was the same as quantization = None, but maybe it's not.

https://github.com/noooop/snippet/blob/d3f69b532b18639b791218e74b7cfe9100816726/benchmarks/test_gptq/test_vllm.py#L117C1-L117C38

Force using MarlinLinearKernel seems really fast!

args.environs = {
"VLLM_DISABLED_KERNELS":
"GPTQMarlinLinearMethod,MacheteLinearKernel"
}

batchsize	vllm + gptq_marlin	vllm + gptq	sglang 0.4.0.post2
1024	2.41	3.01	2.33

But why did I add quantization = "gptq_marlin" and also use the Marlin kernel?

@Flynn-Zh

Please try:

set args.quantization = "gptq_marlin"; None; "gpt"
Force using MarlinLinearKernel

noooop · 2024-12-28T08:46:11Z

for 4090
vllm.zip

quantization = gptq_marlin use marlin

quantization = None use marlin

quantization = gpt use exllama <- This one is a little slower

Flynn-Zh · 2024-12-28T11:59:32Z

quantization = "gptq_marlin"

@noooop the situation tested yesterday

but there were no mistakes today，just reinstalled a lower version of Nsight Systems

vllm.zip

noooop · 2024-12-28T12:30:09Z

last prefill 317 ms vs 254 ms, yes.
Compare sglang 218ms，the difference is not such big

decoding 32ms??? WTF
Even slower than Linear gptq:gmm 30 ms
This is probably why Marlin is not used by default

Summarize for for vllm Kernels

prefill Linear void marlin::Marlin..... yes [1]
prefill attention void flash_fwd_splitkv_kernel..... yes [3]
decoding Linear void marlin::Marlin..... ~ yes [2]
decoding attention void flash_fwd_splitkv_kernel..... ~yes [4]

conclusion

vllm Marlin is slower on l40
Especially during the decoding stage

~~There may be various reasons, but why is there no problem with sglang.~~

~~I suggest you open a new issue named "Marlin slowe in L40"~~

let's double check

Wait a minute, there may also be a problem with the Marlin implementation of sglang

Specifications

GPU Memory Bandwidth | 864GB/s

reference to LLM inference speed of light

Qwen2.5-32B-GPTQ-Int4 18G

18G / 864GB/s = 20ms

vllm Marlin decoding 32ms
vllm Linear gptq:gmm 30 ms
sglang decoding 15.6ms
It may be that vllm Marlin is relatively slow.
It may also be that there is a problem with the sglang Marlin implementation.

noooop · 2024-12-31T05:02:11Z

Prepare to wait for further testing after #11493 is merged.

155394551lzk added the performance Performance-related issues label Dec 18, 2024

joerunde mentioned this issue Dec 18, 2024

[Core] Reduce TTFT with concurrent partial prefills #10235

Open

alexpong0630 mentioned this issue Jan 14, 2025

[Misc]: Very High GPU RX/TX using vllm #11760

Open

1 task

[Performance]: decoding speed on long context #11286

[Performance]: decoding speed on long context #11286

Comments

155394551lzk commented Dec 18, 2024 • edited Loading

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

noooop commented Dec 18, 2024 • edited Loading

Flynn-Zh commented Dec 19, 2024

noooop commented Dec 19, 2024 • edited Loading

Flynn-Zh commented Dec 19, 2024 • edited Loading

noooop commented Dec 19, 2024

Flynn-Zh commented Dec 19, 2024

noooop commented Dec 19, 2024

jeejeelee commented Dec 19, 2024

Flynn-Zh commented Dec 19, 2024

noooop commented Dec 19, 2024

Flynn-Zh commented Dec 19, 2024

jeejeelee commented Dec 19, 2024 • edited Loading

Flynn-Zh commented Dec 20, 2024

JaheimLee commented Dec 20, 2024 • edited Loading

noooop commented Dec 20, 2024 • edited Loading

Flynn-Zh commented Dec 20, 2024

noooop commented Dec 25, 2024 • edited Loading

setting

Offline inference

prefills

decoding

conclusion

Situations not tested

Flynn-Zh commented Dec 26, 2024 • edited Loading

noooop commented Dec 26, 2024 • edited Loading

Flynn-Zh commented Dec 26, 2024

Flynn-Zh commented Dec 26, 2024

noooop commented Dec 26, 2024 • edited Loading

Summarize

Offline inference

prefills

decoding

noooop commented Dec 26, 2024 • edited Loading

noooop commented Dec 26, 2024 • edited Loading

Flynn-Zh commented Dec 26, 2024

noooop commented Dec 26, 2024 • edited Loading

jeejeelee commented Dec 27, 2024

Flynn-Zh commented Dec 27, 2024 • edited Loading

noooop commented Dec 27, 2024

noooop commented Dec 27, 2024

let's try NVIDIA Nsight profile

for vllm

overall

Enlarge the decoding part

Summarize for for vllm Kernels

for sglang

overall

Enlarge the decoding part

Summarize for for vllm Kernels

How to use NVIDIA Nsight profile

noooop commented Dec 27, 2024 • edited Loading

Bryce1010 commented Dec 27, 2024

noooop commented Dec 27, 2024 • edited Loading

jeejeelee commented Dec 27, 2024

Flynn-Zh commented Dec 27, 2024

Flynn-Zh commented Dec 27, 2024

noooop commented Dec 28, 2024 • edited Loading

for sglang

overall

for vllm

overall

vllm Kernels

conclusion

noooop commented Dec 28, 2024

noooop commented Dec 28, 2024 • edited Loading

noooop commented Dec 28, 2024

Flynn-Zh commented Dec 28, 2024

noooop commented Dec 28, 2024 • edited Loading

Summarize for for vllm Kernels

conclusion

let's double check

155394551lzk commented Dec 18, 2024 •

edited

Loading

noooop commented Dec 18, 2024 •

edited

Loading

noooop commented Dec 19, 2024 •

edited

Loading

Flynn-Zh commented Dec 19, 2024 •

edited

Loading

jeejeelee commented Dec 19, 2024 •

edited

Loading

JaheimLee commented Dec 20, 2024 •

edited

Loading

noooop commented Dec 20, 2024 •

edited

Loading

noooop commented Dec 25, 2024 •

edited

Loading

Flynn-Zh commented Dec 26, 2024 •

edited

Loading

noooop commented Dec 26, 2024 •

edited

Loading

noooop commented Dec 26, 2024 •

edited

Loading

noooop commented Dec 26, 2024 •

edited

Loading

noooop commented Dec 26, 2024 •

edited

Loading

noooop commented Dec 26, 2024 •

edited

Loading

Flynn-Zh commented Dec 27, 2024 •

edited

Loading

noooop commented Dec 27, 2024 •

edited

Loading

noooop commented Dec 27, 2024 •

edited

Loading

noooop commented Dec 28, 2024 •

edited

Loading

noooop commented Dec 28, 2024 •

edited

Loading

noooop commented Dec 28, 2024 •

edited

Loading