-
Notifications
You must be signed in to change notification settings - Fork 791
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Print progress bar during cuda graph capture (#2502)
- Loading branch information
1 parent
1fc84cf
commit 21e9e63
Showing
2 changed files
with
21 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Enabling cache for torch.compile | ||
|
||
SGLang uses `max-autotune-no-cudagraphs` mode of torch.compile. The auto-tuning can be slow. | ||
If you want to deploy a model on many different machines, you can ship the torch.compile cache to these machines and skip the compilation steps. | ||
|
||
This is based on https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html | ||
|
||
|
||
1. Generate the cache by setting TORCHINDUCTOR_CACHE_DIR and running the model once. | ||
``` | ||
TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile | ||
``` | ||
2. Copy the cache folder to other machines and launch the server with `TORCHINDUCTOR_CACHE_DIR`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters