Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize cuda graph max_bs_settings on low-end gpus #2360

Conversation

BBuf
Copy link
Collaborator

@BBuf BBuf commented Dec 5, 2024

cuda_graph_max_bs best settings on low-end GPUs

Taking GTX 4090(24GB HBM) as an example

Using sharegpt data, based on sglang v0.3.6

Model Parallel Config cuda graph enabled qps throughput ttft
qwen2-7b tp1 yes 11 5029 0.776
qwen2-7b tp1 no 11 5006 0.421
qwen2-7b tp1 yes 12 5059 1.105
qwen2-7b tp1 no 12 5094 0.626
llama3-8b tp2 yes 3.5 7174 0.748
llama3-8b tp2 no 3.5 7172 0.805
qwen2-57b tp4dp2 yes 14 5785 0.181
qwen2-57b tp4dp2 no 14 5477 0.193
qwen2-72b tp4pp2 yes 1.9 3927 0.891
qwen2-72b tp4pp2 no 1.9 3769 1.208

Based on the above statistics, we can find that on GTX 4090, when using TP1/TP2, we can either disable cuda graph or set cuda_graph_max_bs to a very small value to save memory overhead for creating cuda graphs. When using TP4 or TP8, we need to enable cuda graph to maintain high performance. In this case, we can set cuda_graph_max_bs to half of the default value of 160, which is 80, to reduce the memory overhead of creating cuda graphs. According to the logs of tp4 serving qwen2-72b, 80 is sufficient and can reduce the GPU memory overhead of creating cuda graphs compared to the original 160.

nsys analysis

LLama3-8b tp2

  • no cuda graph

图片

  • with cuda graph

图片

We can see that for TP2 llama3-8b serving, the kernel launch time remains at the nanosecond level regardless of whether cuda graph is enabled or not, indicating that cuda graph has no substantial effect.

Qwen2-72b tp4dp2

  • no cuda graph

图片

  • with cuda graph

图片

We can see that for TP4 qwen2-72b serving, with cuda graph enabled, the kernel launch time is generally at the nanosecond level. However, without cuda graph enabled, the kernel launch time increases to tens of microseconds, showing a significant difference.

@BBuf
Copy link
Collaborator Author

BBuf commented Dec 5, 2024

@merrymercy follow up #2268

@merrymercy merrymercy enabled auto-merge (squash) December 6, 2024 09:12
@merrymercy merrymercy disabled auto-merge December 6, 2024 09:12
@merrymercy merrymercy merged commit 34b364e into sgl-project:main Dec 6, 2024
1 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants