-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add torchao quant (int4/int8/fp8) to llama models #1341
Merged
Merged
Commits on Sep 6, 2024
-
Add torchao quant to sgl llama model for testing
Summary: We want to hack before we work on a proper solution proper solution will be rewrite llama model with tensor parallelism: https://pytorch.org/docs/stable/distributed.tensor.parallel.html (using DTensor underneath), trying to do it here: pytorch/ao#785 Test Plan: change `ENABLE_TORCHAO` to True/False in `python/sglang/srt/models/llama.py` to test the baseline v.s. torchao int4 weight only quant performance python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 ``` max_total_num_tokens=432196 Warmup ... Prefill. latency: 0.03214 s, throughput: 3983.19 token/s Decode. latency: 0.01383 s, throughput: 72.31 token/s Decode. latency: 0.01354 s, throughput: 73.88 token/s Decode. latency: 0.01338 s, throughput: 74.75 token/s Decode. latency: 0.01330 s, throughput: 75.17 token/s Decode. median latency: 0.01346 s, median throughput: 74.31 token/s Total. latency: 0.086 s, throughput: 1531.66 token/s Benchmark ... Prefill. latency: 0.02514 s, throughput: 5092.40 token/s Decode. latency: 0.01337 s, throughput: 74.80 token/s Decode. latency: 0.01338 s, throughput: 74.74 token/s Decode. latency: 0.01339 s, throughput: 74.68 token/s Decode. latency: 0.01321 s, throughput: 75.68 token/s Decode. latency: 0.01295 s, throughput: 77.23 token/s Decode. median latency: 0.01337 s, median throughput: 74.77 token/s Total. latency: 0.132 s, throughput: 1032.13 token/s max_total_num_tokens=505188 Warmup ... Prefill. latency: 0.10929 s, throughput: 1171.18 token/s Decode. latency: 0.00790 s, throughput: 126.57 token/s Decode. latency: 0.00738 s, throughput: 135.54 token/s Decode. latency: 0.00724 s, throughput: 138.16 token/s Decode. latency: 0.00726 s, throughput: 137.71 token/s Decode. median latency: 0.00732 s, median throughput: 136.62 token/s Total. latency: 0.139 s, throughput: 949.17 token/s Benchmark ... Prefill. latency: 0.10405 s, throughput: 1230.13 token/s Decode. latency: 0.00769 s, throughput: 129.96 token/s Decode. latency: 0.00725 s, throughput: 137.85 token/s Decode. latency: 0.00724 s, throughput: 138.11 token/s Decode. latency: 0.00731 s, throughput: 136.72 token/s Decode. latency: 0.00744 s, throughput: 134.47 token/s Decode. median latency: 0.00730 s, median throughput: 136.97 token/s Total. latency: 0.163 s, throughput: 834.99 token/s Warmup ... Prefill. latency: 0.05868 s, throughput: 2181.51 token/s Decode. latency: 0.04475 s, throughput: 22.35 token/s Decode. latency: 0.04463 s, throughput: 22.41 token/s Decode. latency: 0.04467 s, throughput: 22.39 token/s Decode. latency: 0.04478 s, throughput: 22.33 token/s Decode. median latency: 0.04471 s, median throughput: 22.37 token/s Total. latency: 0.238 s, throughput: 555.78 token/s Benchmark ... Prefill. latency: 0.05274 s, throughput: 2427.22 token/s Decode. latency: 0.04463 s, throughput: 22.41 token/s Decode. latency: 0.04456 s, throughput: 22.44 token/s Decode. latency: 0.04453 s, throughput: 22.45 token/s Decode. latency: 0.04469 s, throughput: 22.38 token/s Decode. latency: 0.04457 s, throughput: 22.44 token/s Decode. median latency: 0.04457 s, median throughput: 22.44 token/s Total. latency: 0.409 s, throughput: 332.13 token/s ``` Reviewers: Subscribers: Tasks: Tags:
Configuration menu - View commit details
-
Copy full SHA for 0b095c8 - Browse repository at this point
Copy the full SHA 0b095c8View commit details
Commits on Sep 7, 2024
-
Configuration menu - View commit details
-
Copy full SHA for b3696b5 - Browse repository at this point
Copy the full SHA b3696b5View commit details -
Configuration menu - View commit details
-
Copy full SHA for 281c19b - Browse repository at this point
Copy the full SHA 281c19bView commit details
Commits on Sep 9, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 01cc04e - Browse repository at this point
Copy the full SHA 01cc04eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6712230 - Browse repository at this point
Copy the full SHA 6712230View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8ae193b - Browse repository at this point
Copy the full SHA 8ae193bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 69d2bb5 - Browse repository at this point
Copy the full SHA 69d2bb5View commit details -
Configuration menu - View commit details
-
Copy full SHA for b76dc72 - Browse repository at this point
Copy the full SHA b76dc72View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.