optimize cuda graph max_bs_settings on low-end gpus #2360

BBuf · 2024-12-05T09:40:12Z

cuda_graph_max_bs best settings on low-end GPUs

Taking GTX 4090(24GB HBM) as an example

Using sharegpt data, based on sglang v0.3.6

Model	Parallel Config	cuda graph enabled	qps	throughput	ttft
qwen2-7b	tp1	yes	11	5029	0.776
qwen2-7b	tp1	no	11	5006	0.421
qwen2-7b	tp1	yes	12	5059	1.105
qwen2-7b	tp1	no	12	5094	0.626
llama3-8b	tp2	yes	3.5	7174	0.748
llama3-8b	tp2	no	3.5	7172	0.805
qwen2-57b	tp4dp2	yes	14	5785	0.181
qwen2-57b	tp4dp2	no	14	5477	0.193
qwen2-72b	tp4pp2	yes	1.9	3927	0.891
qwen2-72b	tp4pp2	no	1.9	3769	1.208

Based on the above statistics, we can find that on GTX 4090, when using TP1/TP2, we can either disable cuda graph or set cuda_graph_max_bs to a very small value to save memory overhead for creating cuda graphs. When using TP4 or TP8, we need to enable cuda graph to maintain high performance. In this case, we can set cuda_graph_max_bs to half of the default value of 160, which is 80, to reduce the memory overhead of creating cuda graphs. According to the logs of tp4 serving qwen2-72b, 80 is sufficient and can reduce the GPU memory overhead of creating cuda graphs compared to the original 160.

nsys analysis

LLama3-8b tp2

no cuda graph

with cuda graph

We can see that for TP2 llama3-8b serving, the kernel launch time remains at the nanosecond level regardless of whether cuda graph is enabled or not, indicating that cuda graph has no substantial effect.

Qwen2-72b tp4dp2

no cuda graph

with cuda graph

We can see that for TP4 qwen2-72b serving, with cuda graph enabled, the kernel launch time is generally at the nanosecond level. However, without cuda graph enabled, the kernel launch time increases to tens of microseconds, showing a significant difference.

BBuf · 2024-12-05T09:54:53Z

@merrymercy follow up #2268

…d_gpus

optimize cuda graph max_bs_settings on low-end gpus

14bb76d

BBuf requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners December 5, 2024 09:40

lint

684715f

BBuf and others added 3 commits December 5, 2024 18:15

refine

7328ae9

refine

400037a

Merge branch 'main' into optimize_cuda_graph_max_bs_setting_on_low_en…

4dce048

…d_gpus

merrymercy approved these changes Dec 6, 2024

View reviewed changes

merrymercy enabled auto-merge (squash) December 6, 2024 09:12

merrymercy disabled auto-merge December 6, 2024 09:12

merrymercy merged commit 34b364e into sgl-project:main Dec 6, 2024
1 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize cuda graph max_bs_settings on low-end gpus #2360

optimize cuda graph max_bs_settings on low-end gpus #2360

BBuf commented Dec 5, 2024

BBuf commented Dec 5, 2024

optimize cuda graph max_bs_settings on low-end gpus #2360

optimize cuda graph max_bs_settings on low-end gpus #2360

Conversation

BBuf commented Dec 5, 2024

cuda_graph_max_bs best settings on low-end GPUs

nsys analysis

LLama3-8b tp2

Qwen2-72b tp4dp2

BBuf commented Dec 5, 2024