[Kernel] add triton fused moe kernel for gptq/awq #12185

jinzhen-lin · 2025-01-18T11:21:09Z

The current only option for using moe+gptq/awq is the Marlin kernel, but for the Marlin kernel, a single marlin_gemm_moe would launching num_experts CUDA kernels at least, while the fused_moe triton kernel only needs to launch one cuda kernel. This makes the Marlin kernel significantly slower than the fused_moe triton kernel.

This PR adds support for fused_moe triton kernel with gptq/awq.

Generation speed of deepseek-v3-awq (8*A100-SXM4-80GB, bs=1, short prompt)

	marlin moe kernel	triton fused moe kernel
w/o cuda graph	5.4tok/s	10.0tok/s
w/ cuda graph	10.5tok/s	25.3 tok/s

Note:

to enable cuda graph, use this pr [Bugfix] make moe_align_block_size compliable with cuda graph #12036
to enable this kernel

python -m vllm.entrypoints.openai.api_server \
    --served-model-name model \
    --model cognitivecomputations/DeepSeek-V3-AWQ \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --max-model-len 24576 \
    --dtype half \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.96 \
    --quantization moe_quant_int

github-actions · 2025-01-18T11:21:20Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin requested review from tlrmchlsmth and WoosukKwon as code owners January 18, 2025 11:21

jinzhen-lin added 7 commits January 18, 2025 20:07

add moe_quant_int quantization method

2f3ed3b

Signed-off-by: Jinzhen Lin <[email protected]>

use tl.float32 to dequantize

97f18ef

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

0530452

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

fb7bba5

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

4bd2c31

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

ac8ae24

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

87e191f

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin force-pushed the triton_fused_moe_int4 branch from 91b41c6 to 87e191f Compare January 18, 2025 12:08

jinzhen-lin added 2 commits January 18, 2025 20:11

fix format error

29df4d0

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

99f23f2

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin force-pushed the triton_fused_moe_int4 branch from 21c1d8d to 99f23f2 Compare January 18, 2025 12:14

mgoin self-requested a review January 18, 2025 21:47

fix error

15ae02b

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin force-pushed the triton_fused_moe_int4 branch from 55102d9 to 15ae02b Compare January 19, 2025 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] add triton fused moe kernel for gptq/awq #12185

[Kernel] add triton fused moe kernel for gptq/awq #12185

jinzhen-lin commented Jan 18, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 18, 2025

[Kernel] add triton fused moe kernel for gptq/awq #12185

Are you sure you want to change the base?

[Kernel] add triton fused moe kernel for gptq/awq #12185

Conversation

jinzhen-lin commented Jan 18, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 18, 2025

jinzhen-lin commented Jan 18, 2025 •

edited by github-actions bot

Loading