optimize cuda graph max_bs_settings on low-end gpus #2360
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
cuda_graph_max_bs best settings on low-end GPUs
Taking GTX 4090(24GB HBM) as an example
Using sharegpt data, based on sglang v0.3.6
Based on the above statistics, we can find that on GTX 4090, when using TP1/TP2, we can either disable cuda graph or set cuda_graph_max_bs to a very small value to save memory overhead for creating cuda graphs. When using TP4 or TP8, we need to enable cuda graph to maintain high performance. In this case, we can set cuda_graph_max_bs to half of the default value of 160, which is 80, to reduce the memory overhead of creating cuda graphs. According to the logs of tp4 serving qwen2-72b, 80 is sufficient and can reduce the GPU memory overhead of creating cuda graphs compared to the original 160.
nsys analysis
LLama3-8b tp2
We can see that for TP2 llama3-8b serving, the kernel launch time remains at the nanosecond level regardless of whether cuda graph is enabled or not, indicating that cuda graph has no substantial effect.
Qwen2-72b tp4dp2
We can see that for TP4 qwen2-72b serving, with cuda graph enabled, the kernel launch time is generally at the nanosecond level. However, without cuda graph enabled, the kernel launch time increases to tens of microseconds, showing a significant difference.