Support double sparsity #1459

andy-yang-1 · 2024-09-18T22:57:50Z

Motivation

Support double sparsity (post-training sparse attention) for long context inference in SGLang
See paper

Modifications

Add triton implementation in sglang/python/sglang/srt/layers/sparse_decode_attention.py
Add serving-related parts

Speedup Evaluation

Run double sparsity with:

python -m sglang.bench_latency --model-path lmsys/longchat-7b-v1.5-32k \
    --attention-backend triton --disable-cuda-graph \
    --ds-channel-config-path /path/to/lmsys/longchat-7b-v1.5-32k.json \
    --input-len 20000 --output-len 200 \
    --batch-size 3 \
    --enable-double-sparsity \
    --ds-heavy-channel-num 16 \
    --ds-heavy-token-num 1024 \
    --ds-sparse-decode-threshold 0 \
    --max-total-tokens 70000

Benchmark ...
Prefill. latency: 7.83636 s, throughput:   7656.62 token/s
Decode.  latency: 0.02351 s, throughput:    127.58 token/s
Decode.  latency: 0.02124 s, throughput:    141.22 token/s
Decode.  latency: 0.02037 s, throughput:    147.26 token/s
Decode.  latency: 0.01950 s, throughput:    153.81 token/s
Decode.  latency: 0.01935 s, throughput:    155.04 token/s
Decode.  median latency: 0.01923 s, median throughput:    156.04 token/s
Total. latency: 11.821 s, throughput:   5126.36 token/s

Original triton implementation:

python -m sglang.bench_latency --model-path lmsys/longchat-7b-v1.5-32k \
    --attention-backend triton \
    --input-len 20000 --output-len 200 \
    --batch-size 3

Benchmark ...
Prefill. latency: 7.79627 s, throughput:   7695.98 token/s
Decode.  latency: 0.07196 s, throughput:     41.69 token/s
Decode.  latency: 0.06514 s, throughput:     46.05 token/s
Decode.  latency: 0.06475 s, throughput:     46.33 token/s
Decode.  latency: 0.06463 s, throughput:     46.41 token/s
Decode.  latency: 0.06457 s, throughput:     46.46 token/s
Decode.  median latency: 0.06487 s, median throughput:     46.25 token/s
Total. latency: 20.720 s, throughput:   2924.74 token/s

Original flashinfer implementation:

python -m sglang.bench_latency --model-path lmsys/longchat-7b-v1.5-32k \
    --attention-backend flashinfer \
    --input-len 20000 --output-len 200 \
    --batch-size 3

Benchmark ...
Prefill. latency: 5.68892 s, throughput:  10546.83 token/s
Decode.  latency: 0.03240 s, throughput:     92.60 token/s
Decode.  latency: 0.02993 s, throughput:    100.23 token/s
Decode.  latency: 0.02970 s, throughput:    101.01 token/s
Decode.  latency: 0.02959 s, throughput:    101.39 token/s
Decode.  latency: 0.02959 s, throughput:    101.38 token/s
Decode.  median latency: 0.02961 s, median throughput:    101.32 token/s
Total. latency: 11.585 s, throughput:   5231.00 token/s

With Llama-3.1-8B:

# Double Sparsity
python -m sglang.bench_latency --model-path meta-llama/Llama-3.1-8B-Instruct \
    --attention-backend triton \
    --ds-channel-config-path /path/to/meta-llama/Llama-3.1-8B-Instruct.json \
    --input-len 60000 --output-len 200 \
    --batch-size 3 \
    --enable-double-sparsity \
    --ds-heavy-channel-num 32 \
    --ds-heavy-channel-type k \
    --ds-heavy-token-num 3000 \
    --ds-sparse-decode-threshold 0 \
    --max-total-tokens 200000

Benchmark ...
Prefill. latency: 42.96801 s, throughput:   4189.16 token/s
Decode.  latency: 0.02843 s, throughput:    105.50 token/s
Decode.  latency: 0.02518 s, throughput:    119.16 token/s
Decode.  latency: 0.02465 s, throughput:    121.72 token/s
Decode.  latency: 0.02442 s, throughput:    122.84 token/s
Decode.  latency: 0.02434 s, throughput:    123.24 token/s
Decode.  median latency: 0.02421 s, median throughput:    123.90 token/s
Total. latency: 47.793 s, throughput:   3778.77 token/s

# Triton
python -m sglang.bench_latency --model-path meta-llama/Llama-3.1-8B-Instruct \
    --attention-backend triton \
    --input-len 60000 --output-len 200 \
    --batch-size 3 \
    --max-total-tokens 200000

Benchmark ...
Prefill. latency: 43.17160 s, throughput:   4169.41 token/s
Decode.  latency: 0.06359 s, throughput:     47.18 token/s
Decode.  latency: 0.05965 s, throughput:     50.30 token/s
Decode.  latency: 0.05927 s, throughput:     50.62 token/s
Decode.  latency: 0.05906 s, throughput:     50.80 token/s
Decode.  latency: 0.05906 s, throughput:     50.80 token/s
Decode.  median latency: 0.05913 s, median throughput:     50.73 token/s
Total. latency: 54.950 s, throughput:   3286.63 token/s

# Flashinfer
python -m sglang.bench_latency --model-path meta-llama/Llama-3.1-8B-Instruct \
    --attention-backend flashinfer \
    --input-len 60000 --output-len 200 \
    --batch-size 3 \
    --max-total-tokens 200000

Benchmark ...
Prefill. latency: 27.50800 s, throughput:   6543.55 token/s
Decode.  latency: 0.03014 s, throughput:     99.54 token/s
Decode.  latency: 0.02834 s, throughput:    105.86 token/s
Decode.  latency: 0.02821 s, throughput:    106.36 token/s
Decode.  latency: 0.02819 s, throughput:    106.41 token/s
Decode.  latency: 0.02823 s, throughput:    106.28 token/s
Decode.  median latency: 0.02821 s, median throughput:    106.34 token/s
Total. latency: 33.125 s, throughput:   5452.12 token/s

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

merrymercy · 2024-09-19T09:32:15Z

Great work. Some tips for rebasing:

Recently, we did some refactoring to introduce AttentionBackend. This will make the code of supporting multiple attention backend cleaner Refactor attention backend #1381
Following the above refactor, you can use cuda graph and torch.compile to make it run up to 4x faster (Support cuda graph in the triton attention backend #1401, Enable torch.compile for triton backend #1422)

python/sglang/srt/layers/radix_attention.py

python/sglang/srt/layers/test_ds_kernel.py

python/sglang/srt/mem_cache/memory_pool.py

python/sglang/srt/model_executor/model_runner.py

ghost · 2024-09-24T20:57:26Z

Quick question @andy-yang-1 - Does this PR support just Double Sparsity or DS-Offload as well?

andy-yang-1 · 2024-09-24T21:10:01Z

@vnkc1 Hi, this PR doesn't support DS-Offload for now. DS-Offload may be integrated in other PR if needed.

fengyang95 · 2024-10-09T05:07:50Z

Is there a plan to merge this PR?

merrymercy · 2024-10-11T09:02:14Z

Yes. It should be merged within one week.
@andy-yang-1 please

Resolve the conflicts.
Add an end-to-end accuracy unit test

merrymercy · 2024-10-14T02:55:10Z

Please fix the lint error and add an end-to-end accuracy test

python/sglang/srt/model_executor/forward_batch_info.py

python/sglang/test/Llama-3.1-8B-Instruct.jsonconfig

test/srt/test_double_sparsity.py

merrymercy · 2024-10-14T07:48:59Z

Give two example commands and past their results in the description of this PR. This is for tracking the progress. It should be something like this

# baseline
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 1 --input 1024 --output 8

# double sparsity
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 1 --input 1024 --output 8 --enable-double-sparsity ...

merrymercy · 2024-10-14T08:16:23Z

@andy-yang-1 Can you also paste the latency results?

test/srt/test_double_sparsity.py

test/srt/run_suite.py

merrymercy · 2024-10-14T09:00:52Z

@andy-yang-1 Thanks for the contribution. It is merged.

max99x · 2024-10-14T13:23:54Z

How does one generate the ds-channel-config to be able to use this?

fengyang95 · 2024-10-16T15:30:37Z

I noticed that CUDA graph is not currently supported. Are there any plans to support it? @andy-yang-1

andy-yang-1 · 2024-10-16T17:51:19Z

@max99x You can use this link to generate channel config file.

@fengyang95 We may support it in the next PR

fengyang95 · 2024-10-18T17:26:07Z

hi @andy-yang-1 Does this support the deepseek-v2 architecture? How can I obtain the config for this structure? I see that the example here https://github.com/andy-yang-1/DoubleSparse/blob/main/evaluation/group_channel_config.py only support llama/mixtral arch.

fengyang95 · 2024-10-19T18:16:50Z

@andy-yang-1 I tried running the deepseek-v2 model, but encountered the following issue:

File "/opt/tiger/custome_sglang/python/sglang/srt/layers/attention/double_sparsity_backend.py", line 162, in forward_extend
    k_label = torch.gather(
              ^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument index in method wrapper_CUDA_gather)

  File "/opt/tiger/custome_sglang/python/sglang/srt/layers/attention/__init__.py", line 49, in forward
    return self.forward_extend(q, k, v, layer, forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/custome_sglang/python/sglang/srt/layers/attention/double_sparsity_backend.py", line 162, in forward_extend
    k_label = torch.gather(
              ^^^^^^^^^^^^^
RuntimeError: Size does not match at dimension 1 expected index [7, 128, 16] to be smaller than self [7, 1, 576] apart from dimension 2

andy-yang-1 · 2024-10-19T21:41:03Z

@fengyang95 I haven't added support for deepseek-v2 model. I may add support for this later

fengyang95 · 2024-10-20T04:03:11Z

@fengyang95 I haven't added support for deepseek-v2 model. I may add support for this later

@andy-yang-1 Thank you very much! Looking forward to support for deepseek-v2 and cuda graph.

shreyansh26 · 2024-11-06T14:21:23Z

@andy-yang-1 - Loved the paper! I was trying this out and I am facing a few issues generating the config file using the mentioned script.

The line cos, sin = m.rotary_emb(v, seq_len=kv_seq_len) in stat_qk_max_hook of get_calib_qk_feat gives an error

TypeError: LlamaRotaryEmbedding got an unexpected keyword argument 'seq_len'

I replaced it with cos, sin = m.rotary_emb(v, position_ids=position_ids) which works. I'm not sure if that is correct but LlamaRotaryEmbedding indeed doesn't have the seq_len param

In the config file that gets generated, I only get keys of the form model.layers.{layer_num}.self_attn but the config file present in the test folder has keys in the form of model.layers.{layer_num}.self_attn.q_proj, model.layers.{layer_num}.self_attn.k_proj and model.layers.{layer_num}.self_attn.qk_proj. How were these generated?
On using my generated config with sglang, I am getting error of the type - Key model.layers.0.self_attn.k_proj was not found.

Any help on how to run this would be appreciated.

andy-yang-1 · 2024-11-06T18:08:07Z

@shreyansh26 The first problem is caused by older version of transformers, and I will update the base repo to fix it this week.
The q_outlier_config/k_outlier_config is generated with get_calib_feat function, and the qk_outlier_config is generated with get_qk_calib_feat function. You can merge this two config together to get all configs. I will also update it this week.

shreyansh26 · 2024-11-07T07:31:49Z

Thank you.
There may be another discrepancy, in get_calib_feat, with the following condition, k_proj gets filtered out because of GQA.

if y.shape[-1] != model.config.hidden_size:
    return

But in the Llama-3.1-8B-Instruct config file, k_proj keys are also present.

andy-yang-1 · 2024-11-10T23:13:48Z

@shreyansh26 Hi, I have updated the main repo. Can you try with this code?

shreyansh26 · 2024-11-11T10:02:03Z

Thank you @andy-yang-1!! This is working perfectly now.

yuguo-Jack · 2024-11-23T14:36:13Z

@vnkc1 Hi, this PR doesn't support DS-Offload for now. DS-Offload may be integrated in other PR if needed.
Is there a plan to support DS-Offload in Sglang?

hcyz33 · 2025-01-13T06:16:53Z

Motivation

Support double sparsity (post-training sparse attention) for long context inference in SGLang
See paper

Modifications

Add triton implementation in sglang/python/sglang/srt/layers/sparse_decode_attention.py
Add serving-related parts

Speedup Evaluation

Run double sparsity with:

python -m sglang.bench_latency --model-path lmsys/longchat-7b-v1.5-32k \
    --attention-backend triton --disable-cuda-graph \
    --ds-channel-config-path /path/to/lmsys/longchat-7b-v1.5-32k.json \
    --input-len 20000 --output-len 200 \
    --batch-size 3 \
    --enable-double-sparsity \
    --ds-heavy-channel-num 16 \
    --ds-heavy-token-num 1024 \
    --ds-sparse-decode-threshold 0 \
    --max-total-tokens 70000

Benchmark ...
Prefill. latency: 7.83636 s, throughput:   7656.62 token/s
Decode.  latency: 0.02351 s, throughput:    127.58 token/s
Decode.  latency: 0.02124 s, throughput:    141.22 token/s
Decode.  latency: 0.02037 s, throughput:    147.26 token/s
Decode.  latency: 0.01950 s, throughput:    153.81 token/s
Decode.  latency: 0.01935 s, throughput:    155.04 token/s
Decode.  median latency: 0.01923 s, median throughput:    156.04 token/s
Total. latency: 11.821 s, throughput:   5126.36 token/s

Original triton implementation:

python -m sglang.bench_latency --model-path lmsys/longchat-7b-v1.5-32k \
    --attention-backend triton \
    --input-len 20000 --output-len 200 \
    --batch-size 3

Benchmark ...
Prefill. latency: 7.79627 s, throughput:   7695.98 token/s
Decode.  latency: 0.07196 s, throughput:     41.69 token/s
Decode.  latency: 0.06514 s, throughput:     46.05 token/s
Decode.  latency: 0.06475 s, throughput:     46.33 token/s
Decode.  latency: 0.06463 s, throughput:     46.41 token/s
Decode.  latency: 0.06457 s, throughput:     46.46 token/s
Decode.  median latency: 0.06487 s, median throughput:     46.25 token/s
Total. latency: 20.720 s, throughput:   2924.74 token/s

Original flashinfer implementation:

python -m sglang.bench_latency --model-path lmsys/longchat-7b-v1.5-32k \
    --attention-backend flashinfer \
    --input-len 20000 --output-len 200 \
    --batch-size 3

Benchmark ...
Prefill. latency: 5.68892 s, throughput:  10546.83 token/s
Decode.  latency: 0.03240 s, throughput:     92.60 token/s
Decode.  latency: 0.02993 s, throughput:    100.23 token/s
Decode.  latency: 0.02970 s, throughput:    101.01 token/s
Decode.  latency: 0.02959 s, throughput:    101.39 token/s
Decode.  latency: 0.02959 s, throughput:    101.38 token/s
Decode.  median latency: 0.02961 s, median throughput:    101.32 token/s
Total. latency: 11.585 s, throughput:   5231.00 token/s

With Llama-3.1-8B:

# Double Sparsity
python -m sglang.bench_latency --model-path meta-llama/Llama-3.1-8B-Instruct \
    --attention-backend triton \
    --ds-channel-config-path /path/to/meta-llama/Llama-3.1-8B-Instruct.json \
    --input-len 60000 --output-len 200 \
    --batch-size 3 \
    --enable-double-sparsity \
    --ds-heavy-channel-num 32 \
    --ds-heavy-channel-type k \
    --ds-heavy-token-num 3000 \
    --ds-sparse-decode-threshold 0 \
    --max-total-tokens 200000

Benchmark ...
Prefill. latency: 42.96801 s, throughput:   4189.16 token/s
Decode.  latency: 0.02843 s, throughput:    105.50 token/s
Decode.  latency: 0.02518 s, throughput:    119.16 token/s
Decode.  latency: 0.02465 s, throughput:    121.72 token/s
Decode.  latency: 0.02442 s, throughput:    122.84 token/s
Decode.  latency: 0.02434 s, throughput:    123.24 token/s
Decode.  median latency: 0.02421 s, median throughput:    123.90 token/s
Total. latency: 47.793 s, throughput:   3778.77 token/s

# Triton
python -m sglang.bench_latency --model-path meta-llama/Llama-3.1-8B-Instruct \
    --attention-backend triton \
    --input-len 60000 --output-len 200 \
    --batch-size 3 \
    --max-total-tokens 200000

Benchmark ...
Prefill. latency: 43.17160 s, throughput:   4169.41 token/s
Decode.  latency: 0.06359 s, throughput:     47.18 token/s
Decode.  latency: 0.05965 s, throughput:     50.30 token/s
Decode.  latency: 0.05927 s, throughput:     50.62 token/s
Decode.  latency: 0.05906 s, throughput:     50.80 token/s
Decode.  latency: 0.05906 s, throughput:     50.80 token/s
Decode.  median latency: 0.05913 s, median throughput:     50.73 token/s
Total. latency: 54.950 s, throughput:   3286.63 token/s

# Flashinfer
python -m sglang.bench_latency --model-path meta-llama/Llama-3.1-8B-Instruct \
    --attention-backend flashinfer \
    --input-len 60000 --output-len 200 \
    --batch-size 3 \
    --max-total-tokens 200000

Benchmark ...
Prefill. latency: 27.50800 s, throughput:   6543.55 token/s
Decode.  latency: 0.03014 s, throughput:     99.54 token/s
Decode.  latency: 0.02834 s, throughput:    105.86 token/s
Decode.  latency: 0.02821 s, throughput:    106.36 token/s
Decode.  latency: 0.02819 s, throughput:    106.41 token/s
Decode.  latency: 0.02823 s, throughput:    106.28 token/s
Decode.  median latency: 0.02821 s, median throughput:    106.34 token/s
Total. latency: 33.125 s, throughput:   5452.12 token/s

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

I found that the throughput of prefill is lower when enable DS attention(from 6543.55 to 4189.16 ). The possible reason is that you use triton as attention-backend. Is it possible to use flashinfer attention in prefill to increase the throughput of prefill.

merrymercy reviewed Sep 19, 2024

View reviewed changes

Ying1123 mentioned this pull request Sep 22, 2024

Development Roadmap (2024 Q4) #1487

Open

37 tasks

merrymercy mentioned this pull request Sep 22, 2024

[Feature] KV Cache Compression #1347

Closed

2 tasks

andy-yang-1 force-pushed the double-sparsity branch from 9798dc2 to 57c998b Compare September 25, 2024 00:45

merrymercy mentioned this pull request Oct 6, 2024

[Feature] 4-bit quantized prefix cache #1374

Closed

merrymercy added the high priority label Oct 12, 2024

resolve all conflicts

5f71afa

andy-yang-1 force-pushed the double-sparsity branch from 6b07a3d to 5f71afa Compare October 13, 2024 22:41

andy-yang-1 added 2 commits October 14, 2024 05:26

add unittest & lint

bb38db3

fix lint

831250a

merrymercy requested changes Oct 14, 2024

View reviewed changes

python/sglang/srt/model_executor/forward_batch_info.py Outdated Show resolved Hide resolved

python/sglang/test/Llama-3.1-8B-Instruct.jsonconfig Outdated Show resolved Hide resolved

test/srt/test_double_sparsity.py Show resolved Hide resolved

merrymercy changed the title ~~[WIP] Support double sparsity~~ Support double sparsity Oct 14, 2024

fix lint & add unit test

daf0400

merrymercy reviewed Oct 14, 2024

View reviewed changes

test/srt/test_double_sparsity.py Outdated Show resolved Hide resolved

merrymercy reviewed Oct 14, 2024

View reviewed changes

test/srt/test_double_sparsity.py Show resolved Hide resolved

add automatic backend & comment for channel config

fb592cb

merrymercy requested changes Oct 14, 2024

View reviewed changes

test/srt/test_double_sparsity.py Outdated Show resolved Hide resolved

test/srt/run_suite.py Outdated Show resolved Hide resolved

fix style & unset cuda graph

3ea16f0

merrymercy approved these changes Oct 14, 2024

View reviewed changes

merrymercy enabled auto-merge (squash) October 14, 2024 08:32

merrymercy disabled auto-merge October 14, 2024 09:00

merrymercy merged commit 061e546 into sgl-project:main Oct 14, 2024
10 of 11 checks passed

ChenlongDeng mentioned this pull request Dec 18, 2024

[Feature] Support for Evicting Specific KV Cache to Save GPU Memory #2510

Open

2 tasks

shadowpa0327 mentioned this pull request Jan 17, 2025

[Feature] Enhancement on Sparse Attention and KV-Cache Compression #2946

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support double sparsity #1459

Support double sparsity #1459

andy-yang-1 commented Sep 18, 2024 •

edited

Loading

merrymercy commented Sep 19, 2024 •

edited

Loading

ghost commented Sep 24, 2024

andy-yang-1 commented Sep 24, 2024

fengyang95 commented Oct 9, 2024

merrymercy commented Oct 11, 2024 •

edited

Loading

merrymercy commented Oct 14, 2024

merrymercy commented Oct 14, 2024 •

edited

Loading

merrymercy commented Oct 14, 2024

merrymercy commented Oct 14, 2024

max99x commented Oct 14, 2024

fengyang95 commented Oct 16, 2024

andy-yang-1 commented Oct 16, 2024

fengyang95 commented Oct 18, 2024 •

edited

Loading

fengyang95 commented Oct 19, 2024 •

edited

Loading

andy-yang-1 commented Oct 19, 2024

fengyang95 commented Oct 20, 2024

shreyansh26 commented Nov 6, 2024

andy-yang-1 commented Nov 6, 2024

shreyansh26 commented Nov 7, 2024 •

edited

Loading

andy-yang-1 commented Nov 10, 2024

shreyansh26 commented Nov 11, 2024

yuguo-Jack commented Nov 23, 2024

hcyz33 commented Jan 13, 2025 •

edited

Loading

Motivation

Modifications

Speedup Evaluation

Checklist

Support double sparsity #1459

Support double sparsity #1459

Conversation

andy-yang-1 commented Sep 18, 2024 • edited Loading

Motivation

Modifications

Speedup Evaluation

Checklist

merrymercy commented Sep 19, 2024 • edited Loading

ghost commented Sep 24, 2024

andy-yang-1 commented Sep 24, 2024

fengyang95 commented Oct 9, 2024

merrymercy commented Oct 11, 2024 • edited Loading

merrymercy commented Oct 14, 2024

merrymercy commented Oct 14, 2024 • edited Loading

merrymercy commented Oct 14, 2024

merrymercy commented Oct 14, 2024

max99x commented Oct 14, 2024

fengyang95 commented Oct 16, 2024

andy-yang-1 commented Oct 16, 2024

fengyang95 commented Oct 18, 2024 • edited Loading

fengyang95 commented Oct 19, 2024 • edited Loading

andy-yang-1 commented Oct 19, 2024

fengyang95 commented Oct 20, 2024

shreyansh26 commented Nov 6, 2024

andy-yang-1 commented Nov 6, 2024

shreyansh26 commented Nov 7, 2024 • edited Loading

andy-yang-1 commented Nov 10, 2024

shreyansh26 commented Nov 11, 2024

yuguo-Jack commented Nov 23, 2024

hcyz33 commented Jan 13, 2025 • edited Loading

Motivation

Modifications

Speedup Evaluation

Checklist

andy-yang-1 commented Sep 18, 2024 •

edited

Loading

merrymercy commented Sep 19, 2024 •

edited

Loading

merrymercy commented Oct 11, 2024 •

edited

Loading

merrymercy commented Oct 14, 2024 •

edited

Loading

fengyang95 commented Oct 18, 2024 •

edited

Loading

fengyang95 commented Oct 19, 2024 •

edited

Loading

shreyansh26 commented Nov 7, 2024 •

edited

Loading

hcyz33 commented Jan 13, 2025 •

edited

Loading