Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add torchao quant (int4/int8/fp8) to llama models #1341

Merged
merged 8 commits into from
Sep 9, 2024

Commits on Sep 6, 2024

  1. Add torchao quant to sgl llama model for testing

    Summary:
    We want to hack before we work on a proper solution
    
    proper solution will be rewrite llama model with tensor parallelism: https://pytorch.org/docs/stable/distributed.tensor.parallel.html
    (using DTensor underneath), trying to do it here: pytorch/ao#785
    
    Test Plan:
    change `ENABLE_TORCHAO` to True/False in `python/sglang/srt/models/llama.py` to test the baseline v.s. torchao int4 weight only quant performance
    
    python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8
    
    ```
    
    max_total_num_tokens=432196
    Warmup ...
    Prefill. latency: 0.03214 s, throughput:   3983.19 token/s
    Decode.  latency: 0.01383 s, throughput:     72.31 token/s
    Decode.  latency: 0.01354 s, throughput:     73.88 token/s
    Decode.  latency: 0.01338 s, throughput:     74.75 token/s
    Decode.  latency: 0.01330 s, throughput:     75.17 token/s
    Decode.  median latency: 0.01346 s, median throughput:     74.31 token/s
    Total. latency:  0.086 s, throughput:   1531.66 token/s
    Benchmark ...
    Prefill. latency: 0.02514 s, throughput:   5092.40 token/s
    Decode.  latency: 0.01337 s, throughput:     74.80 token/s
    Decode.  latency: 0.01338 s, throughput:     74.74 token/s
    Decode.  latency: 0.01339 s, throughput:     74.68 token/s
    Decode.  latency: 0.01321 s, throughput:     75.68 token/s
    Decode.  latency: 0.01295 s, throughput:     77.23 token/s
    Decode.  median latency: 0.01337 s, median throughput:     74.77 token/s
    Total. latency:  0.132 s, throughput:   1032.13 token/s
    
    max_total_num_tokens=505188
    Warmup ...
    Prefill. latency: 0.10929 s, throughput:   1171.18 token/s
    Decode.  latency: 0.00790 s, throughput:    126.57 token/s
    Decode.  latency: 0.00738 s, throughput:    135.54 token/s
    Decode.  latency: 0.00724 s, throughput:    138.16 token/s
    Decode.  latency: 0.00726 s, throughput:    137.71 token/s
    Decode.  median latency: 0.00732 s, median throughput:    136.62 token/s
    Total. latency:  0.139 s, throughput:    949.17 token/s
    Benchmark ...
    Prefill. latency: 0.10405 s, throughput:   1230.13 token/s
    Decode.  latency: 0.00769 s, throughput:    129.96 token/s
    Decode.  latency: 0.00725 s, throughput:    137.85 token/s
    Decode.  latency: 0.00724 s, throughput:    138.11 token/s
    Decode.  latency: 0.00731 s, throughput:    136.72 token/s
    Decode.  latency: 0.00744 s, throughput:    134.47 token/s
    Decode.  median latency: 0.00730 s, median throughput:    136.97 token/s
    Total. latency:  0.163 s, throughput:    834.99 token/s
    
    Warmup ...
    Prefill. latency: 0.05868 s, throughput:   2181.51 token/s
    Decode.  latency: 0.04475 s, throughput:     22.35 token/s
    Decode.  latency: 0.04463 s, throughput:     22.41 token/s
    Decode.  latency: 0.04467 s, throughput:     22.39 token/s
    Decode.  latency: 0.04478 s, throughput:     22.33 token/s
    Decode.  median latency: 0.04471 s, median throughput:     22.37 token/s
    Total. latency:  0.238 s, throughput:    555.78 token/s
    Benchmark ...
    Prefill. latency: 0.05274 s, throughput:   2427.22 token/s
    Decode.  latency: 0.04463 s, throughput:     22.41 token/s
    Decode.  latency: 0.04456 s, throughput:     22.44 token/s
    Decode.  latency: 0.04453 s, throughput:     22.45 token/s
    Decode.  latency: 0.04469 s, throughput:     22.38 token/s
    Decode.  latency: 0.04457 s, throughput:     22.44 token/s
    Decode.  median latency: 0.04457 s, median throughput:     22.44 token/s
    Total. latency:  0.409 s, throughput:    332.13 token/s
    ```
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    jerryzh168 committed Sep 6, 2024
    Configuration menu
    Copy the full SHA
    0b095c8 View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2024

  1. add torchao-config

    jerryzh168 committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    b3696b5 View commit details
    Browse the repository at this point in the history
  2. add fp8

    jerryzh168 committed Sep 7, 2024
    Configuration menu
    Copy the full SHA
    281c19b View commit details
    Browse the repository at this point in the history

Commits on Sep 9, 2024

  1. Configuration menu
    Copy the full SHA
    01cc04e View commit details
    Browse the repository at this point in the history
  2. lint

    merrymercy committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    6712230 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8ae193b View commit details
    Browse the repository at this point in the history
  4. update

    merrymercy committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    69d2bb5 View commit details
    Browse the repository at this point in the history
  5. update test cases

    merrymercy committed Sep 9, 2024
    Configuration menu
    Copy the full SHA
    b76dc72 View commit details
    Browse the repository at this point in the history