Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] add triton fused moe kernel for gptq/awq #12185

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jinzhen-lin
Copy link
Contributor

@jinzhen-lin jinzhen-lin commented Jan 18, 2025

The current only option for using moe+gptq/awq is the Marlin kernel, but for the Marlin kernel, a single marlin_gemm_moe would launching num_experts CUDA kernels at least, while the fused_moe triton kernel only needs to launch one cuda kernel. This makes the Marlin kernel significantly slower than the fused_moe triton kernel.

This PR adds support for fused_moe triton kernel with gptq/awq.

Generation speed of deepseek-v3-awq (8*A100-SXM4-80GB, bs=1, short prompt)

marlin moe kernel triton fused moe kernel
w/o cuda graph 5.4tok/s 10.0tok/s
w/ cuda graph 10.5tok/s 25.3 tok/s

Note:

  1. to enable cuda graph, use this pr [Bugfix] make moe_align_block_size compliable with cuda graph #12036
  2. to enable this kernel
python -m vllm.entrypoints.openai.api_server \
    --served-model-name model \
    --model cognitivecomputations/DeepSeek-V3-AWQ \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --max-model-len 24576 \
    --dtype half \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.96 \
    --quantization moe_quant_int

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
@jinzhen-lin jinzhen-lin force-pushed the triton_fused_moe_int4 branch from 91b41c6 to 87e191f Compare January 18, 2025 12:08
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
@jinzhen-lin jinzhen-lin force-pushed the triton_fused_moe_int4 branch from 21c1d8d to 99f23f2 Compare January 18, 2025 12:14
@mgoin mgoin self-requested a review January 18, 2025 21:47
Signed-off-by: Jinzhen Lin <[email protected]>
@jinzhen-lin jinzhen-lin force-pushed the triton_fused_moe_int4 branch from 55102d9 to 15ae02b Compare January 19, 2025 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant