Skip to content

Commit

Permalink
optimize cuda graph max_bs_settings on low-end gpus (#2360)
Browse files Browse the repository at this point in the history
  • Loading branch information
BBuf authored Dec 6, 2024
1 parent 84d96b3 commit 34b364e
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion python/sglang/srt/server_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,8 +184,12 @@ def __post_init__(self):

# Set cuda graph max batch size
if self.cuda_graph_max_bs is None:
# Based on detailed statistics, when serving TP1/TP2 models on lower-end GPUs with HBM<25G, you can either disable cuda graph or set `cuda_graph_max_bs` to a very small value to reduce the memory overhead of creating cuda graphs, with almost no impact on performance. However, when serving models with TP4 or TP8, we need to enable cuda graph to maintain high performance. In this case, we can set `cuda_graph_max_bs` to 80 (half of the default value 160) to reduce the memory overhead of creating cuda graphs. Looking at the logs from TP4 serving of qwen2-72b, a value of 80 is sufficient and can reduce the memory overhead of creating cuda graphs on lower-end GPUs compared to the original 160, avoiding OOM issues.
if gpu_mem is not None and gpu_mem < 25_000:
self.cuda_graph_max_bs = 8
if self.tp_size < 4:
self.cuda_graph_max_bs = 8
else:
self.cuda_graph_max_bs = 80
else:
self.cuda_graph_max_bs = 160

Expand Down

0 comments on commit 34b364e

Please sign in to comment.