Add torchao quant (int4/int8/fp8) to llama models #1341

Summary: We want to hack before we work on a proper solution proper solution will be rewrite llama model with tensor parallelism: https://pytorch.org/docs/stable/distributed.tensor.parallel.html (using DTensor underneath), trying to do it here: pytorch/ao#785 Test Plan: change `ENABLE_TORCHAO` to True/False in `python/sglang/srt/models/llama.py` to test the baseline v.s. torchao int4 weight only quant performance python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 ``` max_total_num_tokens=432196 Warmup ... Prefill. latency: 0.03214 s, throughput: 3983.19 token/s Decode. latency: 0.01383 s, throughput: 72.31 token/s Decode. latency: 0.01354 s, throughput: 73.88 token/s Decode. latency: 0.01338 s, throughput: 74.75 token/s Decode. latency: 0.01330 s, throughput: 75.17 token/s Decode. median latency: 0.01346 s, median throughput: 74.31 token/s Total. latency: 0.086 s, throughput: 1531.66 token/s Benchmark ... Prefill. latency: 0.02514 s, throughput: 5092.40 token/s Decode. latency: 0.01337 s, throughput: 74.80 token/s Decode. latency: 0.01338 s, throughput: 74.74 token/s Decode. latency: 0.01339 s, throughput: 74.68 token/s Decode. latency: 0.01321 s, throughput: 75.68 token/s Decode. latency: 0.01295 s, throughput: 77.23 token/s Decode. median latency: 0.01337 s, median throughput: 74.77 token/s Total. latency: 0.132 s, throughput: 1032.13 token/s max_total_num_tokens=505188 Warmup ... Prefill. latency: 0.10929 s, throughput: 1171.18 token/s Decode. latency: 0.00790 s, throughput: 126.57 token/s Decode. latency: 0.00738 s, throughput: 135.54 token/s Decode. latency: 0.00724 s, throughput: 138.16 token/s Decode. latency: 0.00726 s, throughput: 137.71 token/s Decode. median latency: 0.00732 s, median throughput: 136.62 token/s Total. latency: 0.139 s, throughput: 949.17 token/s Benchmark ... Prefill. latency: 0.10405 s, throughput: 1230.13 token/s Decode. latency: 0.00769 s, throughput: 129.96 token/s Decode. latency: 0.00725 s, throughput: 137.85 token/s Decode. latency: 0.00724 s, throughput: 138.11 token/s Decode. latency: 0.00731 s, throughput: 136.72 token/s Decode. latency: 0.00744 s, throughput: 134.47 token/s Decode. median latency: 0.00730 s, median throughput: 136.97 token/s Total. latency: 0.163 s, throughput: 834.99 token/s Warmup ... Prefill. latency: 0.05868 s, throughput: 2181.51 token/s Decode. latency: 0.04475 s, throughput: 22.35 token/s Decode. latency: 0.04463 s, throughput: 22.41 token/s Decode. latency: 0.04467 s, throughput: 22.39 token/s Decode. latency: 0.04478 s, throughput: 22.33 token/s Decode. median latency: 0.04471 s, median throughput: 22.37 token/s Total. latency: 0.238 s, throughput: 555.78 token/s Benchmark ... Prefill. latency: 0.05274 s, throughput: 2427.22 token/s Decode. latency: 0.04463 s, throughput: 22.41 token/s Decode. latency: 0.04456 s, throughput: 22.44 token/s Decode. latency: 0.04453 s, throughput: 22.45 token/s Decode. latency: 0.04469 s, throughput: 22.38 token/s Decode. latency: 0.04457 s, throughput: 22.44 token/s Decode. median latency: 0.04457 s, median throughput: 22.44 token/s Total. latency: 0.409 s, throughput: 332.13 token/s ``` Reviewers: Subscribers: Tasks: Tags:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add torchao quant (int4/int8/fp8) to llama models #1341

Add torchao quant (int4/int8/fp8) to llama models #1341

Commits on Sep 6, 2024

Commits on Sep 7, 2024

Commits on Sep 9, 2024