Fix the default chunked prefill size #2268

merrymercy · 2024-11-29T23:41:57Z

This is a follow up for the chunked prefill size adjustment in #2225

Only do chunk size adjustment when the chunked prefill size is not specified by the arguments, so we do not silently change users' arguments.
Use a smaller cuda graph max bs for small memory GPUs because cuda graph typically does not bring speedup on these GPUs. A smaller cuda graph max bs saves memory and prevents OOM.

merrymercy · 2024-11-29T23:48:28Z

BBuf · 2024-11-30T06:01:33Z

cc @BBuf

Thanks. Actually, in my nsight-system profiling of Qwen2.5 inference on HuggingFace, I observed that using cuda-graph made no difference because the kernel launch time was already at nanosecond level. I'd like to know the reason for this. In contrast, on A800, without using cuda graph, the kernel launch time during decoding phase is even longer than the inference time itself. With cuda graph enabled, the time for a complete decoding iteration can be reduced by half.

merrymercy · 2024-12-01T05:14:54Z

@BBuf I do not fully understand that as well. Based on my e2e test. On A100/H100, cuda graph is very useful. On low-end GPUs (3090, A10G, L40), it almost has no effect when tp=1.

I do not have time to dig into that, but you can play with it more.

Fix chunked prefill size

4573194

merrymercy requested review from Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners November 29, 2024 23:41

merrymercy changed the title ~~Fix chunked prefill size~~ Fix the default chunked prefill size Nov 29, 2024

merrymercy merged commit 94e167e into main Nov 30, 2024
9 of 15 checks passed

merrymercy deleted the pr-fix-cuda-graph branch November 30, 2024 00:03

BBuf mentioned this pull request Dec 5, 2024

optimize cuda graph max_bs_settings on low-end gpus #2360

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the default chunked prefill size #2268

Fix the default chunked prefill size #2268

merrymercy commented Nov 29, 2024 •

edited

Loading

merrymercy commented Nov 29, 2024

BBuf commented Nov 30, 2024

merrymercy commented Dec 1, 2024 •

edited

Loading

Fix the default chunked prefill size #2268

Fix the default chunked prefill size #2268

Conversation

merrymercy commented Nov 29, 2024 • edited Loading

merrymercy commented Nov 29, 2024

BBuf commented Nov 30, 2024

merrymercy commented Dec 1, 2024 • edited Loading

merrymercy commented Nov 29, 2024 •

edited

Loading

merrymercy commented Dec 1, 2024 •

edited

Loading