Optimize Triton decoding kernel for long context #2394

ispobock · 2024-12-08T08:51:55Z

Motivation

As mentioned in #2271, the original triton decoding kernel has significant performance degradation on long context. We refactored the kernel and adapted the flash decoding implementation from lightllm. Currently, the long context speed decay has been alleviated a lot.

Benchmark

Tested for input 128, output 2048.

Triton (this PR) num_kv_splits=8: 150->138

$ python3 -m sglang.bench_offline_throughput --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompt 1 --random-input 128 --random-output 2048 --random-range 1 --attention-backend triton

[2024-12-08 08:37:24 TP0] Decode batch. #running-req: 1, #token: 194, token usage: 0.00, gen throughput (token/s): 150.66, #queue-req: 0

[2024-12-08 08:37:37 TP0] Decode batch. #running-req: 1, #token: 2154, token usage: 0.00, gen throughput (token/s): 138.86, #queue-req: 0

We can increase the --triton-attention-num-kv-splits to get better performance on long context.

Triton (this PR) num_kv_splits=16: 150->144

python3 -m sglang.bench_offline_throughput --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompt 1 --random-input 128 --random-output 2048 --random-range 1 --attention-backend triton --triton-attention-num-kv-splits 16

[2024-12-08 08:40:28 TP0] Decode batch. #running-req: 1, #token: 194, token usage: 0.00, gen throughput (token/s): 150.18, #queue-req: 0

[2024-12-08 08:40:42 TP0] Decode batch. #running-req: 1, #token: 2154, token usage: 0.00, gen throughput (token/s): 144.00, #queue-req: 0

Triton (main branch): 147->126

$ python3 -m sglang.bench_offline_throughput --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompt 1 --random-input 128 --random-output 2048 --random-range 1 --attention-backend triton

[2024-12-08 08:35:01 TP0] Decode batch. #running-req: 1, #token: 194, token usage: 0.00, gen throughput (token/s): 147.93, #queue-req: 0

[2024-12-08 08:35:15 TP0] Decode batch. #running-req: 1, #token: 2154, token usage: 0.00, gen throughput (token/s): 126.67, #queue-req: 0

Flashinfer: 143->143

$ python3 -m sglang.bench_offline_throughput --model meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompt 1 --random-input 128 --random-output 2048 --random-range 1

[2024-12-08 08:43:15 TP0] Decode batch. #running-req: 1, #token: 194, token usage: 0.00, gen throughput (token/s): 143.84, #queue-req: 0

[2024-12-08 08:43:29 TP0] Decode batch. #running-req: 1, #token: 2154, token usage: 0.00, gen throughput (token/s): 143.24, #queue-req: 0

merrymercy · 2024-12-08T08:55:07Z

python/sglang/srt/layers/attention/triton_ops/decode_attention.py

@@ -705,10 +650,10 @@ def decode_attention_fwd(
            o,
            req_to_token,
            b_req_idx,
-            b_start_loc,


remove this in the func signature of decode_attention_fwd?

Sure. max_len_in_batch and triton_attention_reduce_in_fp32 may also need to be removed.

merrymercy · 2024-12-08T08:55:56Z

python/sglang/srt/layers/attention/triton_backend.py

+                    forward_batch.batch_size,
+                    self.num_head,
+                    self.num_kv_splits,
+                    self.v_head_dim + 1,


After this, we do not need to reduce the cuda graph max bs for deepseek models?

Let me verify it.

ispobock added 5 commits December 7, 2024 16:18

flash decoding draft

c35838d

fix acc

e9e7267

add args

1223a55

fix acc bug

dd4baa4

update ref

35356f0

ispobock requested review from merrymercy, Ying1123, zhyncs, hnyls2002 and ByronHsu as code owners December 8, 2024 08:51

Merge branch 'main' into flash-decoding

d42c098

merrymercy reviewed Dec 8, 2024

View reviewed changes

merrymercy approved these changes Dec 8, 2024

View reviewed changes

Merge branch 'main' into flash-decoding

1be3d36

merrymercy merged commit 7dc66fc into sgl-project:main Dec 8, 2024
0 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Triton decoding kernel for long context #2394

Optimize Triton decoding kernel for long context #2394

ispobock commented Dec 8, 2024 •

edited

Loading

merrymercy Dec 8, 2024

ispobock Dec 8, 2024

merrymercy Dec 8, 2024

ispobock Dec 8, 2024

Optimize Triton decoding kernel for long context #2394

Optimize Triton decoding kernel for long context #2394

Conversation

ispobock commented Dec 8, 2024 • edited Loading

Motivation

Benchmark

merrymercy Dec 8, 2024

Choose a reason for hiding this comment

ispobock Dec 8, 2024

Choose a reason for hiding this comment

merrymercy Dec 8, 2024

Choose a reason for hiding this comment

ispobock Dec 8, 2024

Choose a reason for hiding this comment

ispobock commented Dec 8, 2024 •

edited

Loading