-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MoE Expert Parallel Impl #2203
MoE Expert Parallel Impl #2203
Conversation
It's amazing! Please join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2rtikx2pv-DUfPrhx2SaNAq~47YtV1XQ
We currently have a kernel library, and I can add you to the team for collaboration. |
6d22215
to
c611608
Compare
I'm in the slack channel |
Not sure why one of the unit tests failed. It seems unrelated to this pr. |
@xiaobochen123 Does this rely on dp-attention? Also, from what I understand, ep shouldn't add any extra memory burden, right? |
@xiaobochen123 May you highlight how TP is used along with EP within MoE layer here? |
Amazing works, @xiaobochen123 This is super dope! |
@fengyang95 1. Does not rely on DP-attention. 2. Does not add extra memory. |
@HaiShaw When enables EP, the MoE module is EP, with no TP (EP-Size = TP-Size). The Attention is TP. Regarding the imbalance issue, you can perform performance analysis using nsys. If the MoE execution time differs across GPUs, you can get it. For more precise analysis, you can collect the top-k information for each layer. |
fe21d38
to
10c405a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaobochen123 Thanks for the awesome contribution! Overall LGTM. Left some comments for minor changes.
|
||
BLOCK_SIZE_M = 128 | ||
BLOCK_SIZE_N = 128 | ||
BLOCK_SIZE_K = 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make them kargs, so tunable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There have room for improvement. I am not very familiar with triton. Is there a reference demo for triton tune?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most kargs as in https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/fused_moe_triton/fused_moe.py
can be tuned.
For the tuning, we can refer to https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the MoE-TP Triton kernel, the best config is selected based on M, where M is hidden_states.size(0), which is a fixed value.
In MoE-EP, however, M can vary across layers or experts, which may pose some challenges in identifying a configuration that works well for all cases.
This primarily focuses on further optimizing performance and don`t affect the current functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - can we add use_fp8_w8a8
etc. key flags as kargs?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiaobochen123 , very nice work, and look forward to more contributions from you!
Several general comments:
- Would you please add some more comments, to new function and/or kernel for example?
- Can we follow similar/closer naming (variable, func, kernel) conventions from existing
fused_moe.py
? - What is the assumption we have from quantization (fp8, etc.) to EP support, e.g. should TP quantized model work seamlessly with EP?
--tp 8 --enable-ep-moe
seemed to be confusion, as TP/EP doesn't coexist, possible to introduce--ep <NUM>
or similar kind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work! Regards the CUTLASS integration, we also opened a new folder to start integrating more GPU kernel code here https://github.com/sgl-project/sglang/tree/main/sgl-kernel.
We plan to gradually add more kernels like custom allreduce, fp8 gemm there. You can also put the CUTLASS group GEMM there. It will be a separate python package but we use the sglang mono code repo to host it so we can share the CI resources and avoid the sync between two repos.
This is a very good contribution so we would like to merge the triton one as soon as possible. We can then do the CUTLASS integration in follow up PRs.
|
3aa1e25
to
fa88f23
Compare
to merge, thanks. |
@HaiShaw I tested the model, and it`s ok |
@xiaobochen123 May you update branch, prepare for merge? |
3ddaed8
to
d624adf
Compare
@HaiShaw Updated |
@xiaobochen123 can you fix this CI error:
|
This PR is great, thanks @xiaobochen123! I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP. Here are the benchmarking scripts: model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct
# TP
python -m sglang.bench_offline_throughput \
--model-path $model \
--num-prompts 1000 \
--trust-remote-code \
--tp-size 8
# EP
python -m sglang.bench_offline_throughput \
--model-path $model \
--num-prompts 1000 \
--trust-remote-code \
--tp-size 8 \
--ep-size 8 \
--enable-ep-moe The output token throughputs on DGX-A100 are: Does this align with your expectation given your above benchmarks? Thanks! Minor comment: although the |
@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar. |
Co-authored-by: HAI <[email protected]>
If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now? |
@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP. |
Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation? |
@sitabulaixizawaluduo Sorry, I might not have fully understood your point. Why would TP + EP be used in an MLP? From my understanding, an MLP uses either TP or EP. |
For example, a model with 8 experts performs four-card calculations, tp_size = 2, ep_size = 2, in the MLP layer, GPU0 and GPU1, using 4 experts, this 4 experts do parallel computing with tp |
@sitabulaixizawaluduo TP requires one communication, and EP adds another. Could this result in degraded performance? |
After TP, the amount of MLP computation on each gpu is reduced. In this case, the worst case (the deployed experts on the different devices) The traffic is the same as TP, but if the deployed experts one the same device, the communication speed gpu memory bandwidth > nvlink > > PCIE |
@sitabulaixizawaluduo From my understanding, the overall computation should be the same for TP=4, EP=4, and TP=2/EP=2. It’s unlikely that TP=2/EP=2 would have less computation. In my current implementation, MoE-EP uses AllReduce, and the communication cost is the same as TP. |
Thanks! I will recalculate the amount of forward communication and computation for this part. By the way, when I tested the response time of a single request( input length 1000 - 2000) on the L40 (PCIE), I found that the TP speed is faster than that of the EP. When TP is deployed: --tp-size = 2, when EP is deployed: --tp-size = 2 --enable-ep-moe, what is the reason for the performance gap between the two? |
@sitabulaixizawaluduo Sorry, I don’t have L40, so I’m unable to perform specific tests or analyze the performance directly. However, there might be a few possible reasons: the current MoE-EP implementation in Triton do not have been fully tuned or configured with the best settings; additionally, Expert Parallel might face load balancing issues when concurrency is low. |
Thanks! |
Motivation
This is the implementation of MoE Expert Parallel, seamlessly integrated with TP. The expert parallel size (ep-size) matches the tensor parallel size (tp-size).
It supports the following formats:
I used CUDA/CUTLASS to implement these operators. However, integrating CUDA/CUTLASS into SGLang is difficult. Therefore, I rewrote the implementation using Triton. Overall, the Triton implementation performs well, except for Grouped-GEMM, where its performance is significantly lower than the CUTLASS implementation (several times slower on H100).
Modifications
Performance
Note: EP is not yet at its optimal performance. The final performance of EP parallelism will better than TP.
This PR: TP + DP + EP
Base : TP + DP
Test Model: neuralmagic/DeepSeek-Coder-V2-Instruct-FP8
Hardward: 8xH100
Prefill:
** Base: 21004.67 tokens/s
** This PR: 26039.52 tokens/s
Decode:
** Base : 10738.72 tokens/s
** This PR: 10591.17 tokens/s
Future Work
The work on SGLang has been basically completed. Moving forward, further performance optimization for Grouped-GEMM is required.
I plan to integrate the related optimizations for Grouped-GEMM (CUTLASS) into FlashInfer, enabling SGLang to access high-performance Grouped-GEMM implementations through FlashInfer. Additionally, further optimization of the Triton Grouped-GEMM implementation is necessary.
Update: 11/27
I do a fine-grained performance analysis of MoE-EP on neuralmagic/DeepSeek-Coder-V2-Instruct-FP8
One layer performance results
Prefill phase
The table below shows the time of each module in a specific layer during the Prefill phase, measured in milliseconds (ms).
It can be observed that the MoE-EP kernel has achieved approximately a 3x performance improvement compared to TP. However, since the MoE component constitutes only a small portion (with Attention being the major contributor), the overall model performance improvement in testing is limited.
Base(TP) = 13.8 ms
ThisPR(TP+EP) = 11.2ms
13.8/11.2 = 1.23
26039.52/21004.67 = 1.24
It can be seen that these two numbers are essentially consistent.
Base(TP)-Nsys
ThisPR(TP+EP)-Nsys
Decode phase
The decode phase was also analyzed.
Base(TP) = 4.3ms
ThisPR(TP+EP) = 4.3ms(Approximately)
From the kernel perspective, there is also an improvement in the MoE section during the decode phase. However, due to MoE expert load balancing, the computation time varies across GPUs (with the worst-case scenario pulling down the lower bound). As a result, the time difference is minimal, which aligns with the E2E test results.
Additionally, even if the load is well-balanced, the MoE component still makes up a small portion of the model (similar to the prefill phase, where Attention dominates). Even with further optimization of MoE, the overall improvement would still be minimal.
Base(TP)-Nsys
ThisPR(TP+EP)-Nsys
In summary, the current MoE-EP kernel shows a 1-3x improvement over MoE-TP. However, since the MoE component makes up a small portion of DeepSeek-V2, the overall improvement in the E2E test is not significant. (Of course, the MoE-EP Triton implementation can still be further improved.)
Additionally, based on the nsys results, the current MLA implementation can still be further improved. I will submit my implementation for it in the future.
Checklist