MoE Expert Parallel Impl #2203

xiaobochen123 · 2024-11-26T12:58:14Z

Motivation

This is the implementation of MoE Expert Parallel, seamlessly integrated with TP. The expert parallel size (ep-size) matches the tensor parallel size (tp-size).
It supports the following formats:

[ -] FP16/BF16
[ -] FP8:Dynamic/Static

I used CUDA/CUTLASS to implement these operators. However, integrating CUDA/CUTLASS into SGLang is difficult. Therefore, I rewrote the implementation using Triton. Overall, the Triton implementation performs well, except for Grouped-GEMM, where its performance is significantly lower than the CUTLASS implementation (several times slower on H100).

Modifications

Add MoE EP Layer and kernels.
Add unit test for moe-ep
Add server args: --enable-ep-moe. If want to use moe-ep, just add --enable-ep-moe

Performance

Note: EP is not yet at its optimal performance. The final performance of EP parallelism will better than TP.

This PR: TP + DP + EP
Base : TP + DP
Test Model: neuralmagic/DeepSeek-Coder-V2-Instruct-FP8
Hardward: 8xH100
Prefill:
** Base: 21004.67 tokens/s
** This PR: 26039.52 tokens/s
Decode:
** Base : 10738.72 tokens/s
** This PR: 10591.17 tokens/s

# Base:
python3 -m sglang.launch_server             \
    --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 \
    --trust-remote-code                     \
    --disable-radix-cache                   \
    --quantization "fp8"                    \
    --kv-cache-dtype "fp8_e5m2"             \
    --mem-fraction-static 0.7              \
    --max-running-requests 4096             \
    --max-prefill-tokens  16384             \
    --chunked-prefill-size 16384            \
    --tp 8                                  \
    --dp 8 --enable-dp-attention

# This PR
python3 -m sglang.launch_server             \
    --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 \
    --trust-remote-code                     \
    --disable-radix-cache                   \
    --quantization "fp8"                    \
    --kv-cache-dtype "fp8_e5m2"             \
    --mem-fraction-static 0.7              \
    --max-running-requests 4096             \
    --max-prefill-tokens  16384             \
    --chunked-prefill-size 16384            \
    --tp 8                                  \
    --dp 8 --enable-dp-attention            \
    --enable-ep-moe

# Test
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 512 --random-output 1 --random-range-ratio 1 --num-prompts 10000

python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 10000

Future Work

The work on SGLang has been basically completed. Moving forward, further performance optimization for Grouped-GEMM is required.
I plan to integrate the related optimizations for Grouped-GEMM (CUTLASS) into FlashInfer, enabling SGLang to access high-performance Grouped-GEMM implementations through FlashInfer. Additionally, further optimization of the Triton Grouped-GEMM implementation is necessary.

Update: 11/27

I do a fine-grained performance analysis of MoE-EP on neuralmagic/DeepSeek-Coder-V2-Instruct-FP8

One layer performance results

Prefill phase

The table below shows the time of each module in a specific layer during the Prefill phase, measured in milliseconds (ms).

It can be observed that the MoE-EP kernel has achieved approximately a 3x performance improvement compared to TP. However, since the MoE component constitutes only a small portion (with Attention being the major contributor), the overall model performance improvement in testing is limited.

Base(TP) = 13.8 ms
ThisPR(TP+EP) = 11.2ms
13.8/11.2 = 1.23
26039.52/21004.67 = 1.24
It can be seen that these two numbers are essentially consistent.

Base(TP)-Nsys
ThisPR(TP+EP)-Nsys

Decode phase

The decode phase was also analyzed.

Base(TP) = 4.3ms
ThisPR(TP+EP) = 4.3ms(Approximately)

From the kernel perspective, there is also an improvement in the MoE section during the decode phase. However, due to MoE expert load balancing, the computation time varies across GPUs (with the worst-case scenario pulling down the lower bound). As a result, the time difference is minimal, which aligns with the E2E test results.

Additionally, even if the load is well-balanced, the MoE component still makes up a small portion of the model (similar to the prefill phase, where Attention dominates). Even with further optimization of MoE, the overall improvement would still be minimal.

Base(TP)-Nsys
ThisPR(TP+EP)-Nsys

In summary, the current MoE-EP kernel shows a 1-3x improvement over MoE-TP. However, since the MoE component makes up a small portion of DeepSeek-V2, the overall improvement in the E2E test is not significant. (Of course, the MoE-EP Triton implementation can still be further improved.)

Additionally, based on the nsys results, the current MLA implementation can still be further improved. I will submit my implementation for it in the future.

Checklist

[ -] Format your code according to the Contributor Guide.
[ -] Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhyncs · 2024-11-26T13:01:29Z

It's amazing! Please join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2rtikx2pv-DUfPrhx2SaNAq~47YtV1XQ

I used CUDA/CUTLASS to implement these operators. However, integrating CUDA/CUTLASS into SGLang is difficult. Therefore, I rewrote the implementation using Triton.

We currently have a kernel library, and I can add you to the team for collaboration.

xiaobochen123 · 2024-11-26T13:22:07Z

It's amazing! Please join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2rtikx2pv-DUfPrhx2SaNAq~47YtV1XQ

I used CUDA/CUTLASS to implement these operators. However, integrating CUDA/CUTLASS into SGLang is difficult. Therefore, I rewrote the implementation using Triton.

We currently have a kernel library, and I can add you to the team for collaboration.

I'm in the slack channel

xiaobochen123 · 2024-11-26T16:25:52Z

Not sure why one of the unit tests failed. It seems unrelated to this pr.

xiaobochen123 · 2024-11-27T14:00:00Z

I have updated the latest performance analysis results. See Update: 11/27. @zhyncs @ispobock

fengyang95 · 2024-11-28T09:15:26Z

@xiaobochen123 Does this rely on dp-attention? Also, from what I understand, ep shouldn't add any extra memory burden, right?

HaiShaw · 2024-11-28T09:45:23Z

@xiaobochen123 May you highlight how TP is used along with EP within MoE layer here?
Btw, do & how we count in the traffic imbalance criteria into benchmarking? Thanks!

austin362667 · 2024-11-28T10:10:32Z

Amazing works, @xiaobochen123 This is super dope!

xiaobochen123 · 2024-11-29T02:34:37Z

@xiaobochen123 Does this rely on dp-attention? Also, from what I understand, ep shouldn't add any extra memory burden, right?

@fengyang95 1. Does not rely on DP-attention. 2. Does not add extra memory.

xiaobochen123 · 2024-11-29T02:35:03Z

@xiaobochen123 May you highlight how TP is used along with EP within MoE layer here? Btw, do & how we count in the traffic imbalance criteria into benchmarking? Thanks!

@HaiShaw When enables EP, the MoE module is EP, with no TP (EP-Size = TP-Size). The Attention is TP. Regarding the imbalance issue, you can perform performance analysis using nsys. If the MoE execution time differs across GPUs, you can get it. For more precise analysis, you can collect the top-k information for each layer.

ispobock

@xiaobochen123 Thanks for the awesome contribution! Overall LGTM. Left some comments for minor changes.

python/sglang/srt/layers/ep_moe/kernels.py

python/sglang/srt/layers/ep_moe/layer.py

python/sglang/srt/layers/ep_moe/kernels.py

python/sglang/srt/layers/ep_moe/layer.py

python/sglang/srt/layers/ep_moe/kernels.py

HaiShaw · 2024-11-30T22:07:46Z

python/sglang/srt/layers/ep_moe/kernels.py

+
+    BLOCK_SIZE_M = 128
+    BLOCK_SIZE_N = 128
+    BLOCK_SIZE_K = 128


Make them kargs, so tunable?

There have room for improvement. I am not very familiar with triton. Is there a reference demo for triton tune?

Most kargs as in https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/fused_moe_triton/fused_moe.py can be tuned.
For the tuning, we can refer to https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton

In the MoE-TP Triton kernel, the best config is selected based on M, where M is hidden_states.size(0), which is a fixed value.
In MoE-EP, however, M can vary across layers or experts, which may pose some challenges in identifying a configuration that works well for all cases.
This primarily focuses on further optimizing performance and don`t affect the current functionality.

HaiShaw

Also - can we add use_fp8_w8a8 etc. key flags as kargs?
Thanks!

python/sglang/srt/layers/ep_moe/kernels.py

HaiShaw

@xiaobochen123 , very nice work, and look forward to more contributions from you!
Several general comments:

Would you please add some more comments, to new function and/or kernel for example?
Can we follow similar/closer naming (variable, func, kernel) conventions from existing fused_moe.py?
What is the assumption we have from quantization (fp8, etc.) to EP support, e.g. should TP quantized model work seamlessly with EP?
--tp 8 --enable-ep-moe seemed to be confusion, as TP/EP doesn't coexist, possible to introduce --ep <NUM> or similar kind?

merrymercy

Very nice work! Regards the CUTLASS integration, we also opened a new folder to start integrating more GPU kernel code here https://github.com/sgl-project/sglang/tree/main/sgl-kernel.
We plan to gradually add more kernels like custom allreduce, fp8 gemm there. You can also put the CUTLASS group GEMM there. It will be a separate python package but we use the sglang mono code repo to host it so we can share the CI resources and avoid the sync between two repos.

This is a very good contribution so we would like to merge the triton one as soon as possible. We can then do the CUTLASS integration in follow up PRs.

python/sglang/srt/layers/ep_moe/kernels.py

test/srt/test_moe_ep.py

xiaobochen123 · 2024-12-02T07:31:01Z

@xiaobochen123 , very nice work, and look forward to more contributions from you! Several general comments:

Would you please add some more comments, to new function and/or kernel for example?

Can we follow similar/closer naming (variable, func, kernel) conventions from existing fused_moe.py?

What is the assumption we have from quantization (fp8, etc.) to EP support, e.g. should TP quantized model work seamlessly with EP?

--tp 8 --enable-ep-moe seemed to be confusion, as TP/EP doesn't coexist, possible to introduce --ep <NUM> or similar kind?

similar/closer naming with fused_moe.py, could you tell me more about what need fix.
I think that TP quantized models are seamless to EP. Because when TP quantizes, it calculates a scale according to each expert, which does not affect EP.
I have add —ep-size args。
@HaiShaw

HaiShaw · 2024-12-04T17:02:43Z

to merge, thanks.

xiaobochen123 · 2024-12-04T23:57:58Z

@xiaobochen123 nice update - can you give a try to below model too? amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV

Accuracy result, you may ignore for now, just check if it runs to completion.

@HaiShaw I tested the model, and it`s ok

HaiShaw · 2024-12-05T01:10:48Z

@xiaobochen123 May you update branch, prepare for merge?

xiaobochen123 · 2024-12-05T02:19:39Z

@xiaobochen123 May you update branch, prepare for merge?

@HaiShaw Updated

HaiShaw · 2024-12-05T09:16:36Z

@xiaobochen123 can you fix this CI error:

  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/models/mixtral.py", line 82, in __init__
    MoEImpl = EPMoE if global_server_args_dict["enable_ep_moe"] else FusedMoE
NameError: name 'global_server_args_dict' is not defined

Mutinifni · 2024-12-05T18:46:52Z

This PR is great, thanks @xiaobochen123!

I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.

Here are the benchmarking scripts:

model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe

The output token throughputs on DGX-A100 are:
TP: 1027.46 tok/s
EP: 904.29 tok/s

Does this align with your expectation given your above benchmarks? Thanks!

Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

xiaobochen123 · 2024-12-06T03:19:04Z

This PR is great, thanks @xiaobochen123!

I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.

Here are the benchmarking scripts:
model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe
The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s

Does this align with your expectation given your above benchmarks? Thanks!

Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.

Co-authored-by: HAI <[email protected]>

sitabulaixizawaluduo · 2024-12-10T02:54:19Z

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:
model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe
The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?
@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.

If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?

xiaobochen123 · 2024-12-10T02:58:17Z

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:
model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe
The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?
@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.
If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?

@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.

sitabulaixizawaluduo · 2024-12-10T03:38:16Z

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:
model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe
The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?
@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.
If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?
@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.

Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation?

xiaobochen123 · 2024-12-10T03:44:55Z

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:
model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe
The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?
@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.
If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?
@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.
Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation?

@sitabulaixizawaluduo Sorry, I might not have fully understood your point. Why would TP + EP be used in an MLP? From my understanding, an MLP uses either TP or EP.

sitabulaixizawaluduo · 2024-12-10T03:56:02Z

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:
model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe
The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?
@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.
If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?
@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.
Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation?
@sitabulaixizawaluduo Sorry, I might not have fully understood your point. Why would TP + EP be used in an MLP? From my understanding, an MLP uses either TP or EP.

For example, a model with 8 experts performs four-card calculations, tp_size = 2, ep_size = 2, in the MLP layer, GPU0 and GPU1, using 4 experts, this 4 experts do parallel computing with tp

xiaobochen123 · 2024-12-10T05:54:06Z

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:
model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe
The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?
@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.
If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?
@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.
Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation?
@sitabulaixizawaluduo Sorry, I might not have fully understood your point. Why would TP + EP be used in an MLP? From my understanding, an MLP uses either TP or EP.
For example, a model with 8 experts performs four-card calculations, tp_size = 2, ep_size = 2, in the MLP layer, GPU0 and GPU1, using 4 experts, this 4 experts do parallel computing with tp

@sitabulaixizawaluduo TP requires one communication, and EP adds another. Could this result in degraded performance?

sitabulaixizawaluduo · 2024-12-10T06:10:02Z

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:
model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe
The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?
@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.
If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?
@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.
Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation?
@sitabulaixizawaluduo Sorry, I might not have fully understood your point. Why would TP + EP be used in an MLP? From my understanding, an MLP uses either TP or EP.
For example, a model with 8 experts performs four-card calculations, tp_size = 2, ep_size = 2, in the MLP layer, GPU0 and GPU1, using 4 experts, this 4 experts do parallel computing with tp
@sitabulaixizawaluduo TP requires one communication, and EP adds another. Could this result in degraded performance?

After TP, the amount of MLP computation on each gpu is reduced. In this case, the worst case (the deployed experts on the different devices) The traffic is the same as TP, but if the deployed experts one the same device, the communication speed gpu memory bandwidth > nvlink > > PCIE

xiaobochen123 · 2024-12-10T06:23:16Z

@sitabulaixizawaluduo From my understanding, the overall computation should be the same for TP=4, EP=4, and TP=2/EP=2. It’s unlikely that TP=2/EP=2 would have less computation. In my current implementation, MoE-EP uses AllReduce, and the communication cost is the same as TP.

sitabulaixizawaluduo · 2024-12-10T06:32:13Z

@sitabulaixizawaluduo From my understanding, the overall computation should be the same for TP=4, EP=4, and TP=2/EP=2. It’s unlikely that TP=2/EP=2 would have less computation. In my current implementation, MoE-EP uses AllReduce, and the communication cost is the same as TP.

Thanks! I will recalculate the amount of forward communication and computation for this part. By the way, when I tested the response time of a single request( input length 1000 - 2000) on the L40 (PCIE), I found that the TP speed is faster than that of the EP. When TP is deployed: --tp-size = 2, when EP is deployed: --tp-size = 2 --enable-ep-moe, what is the reason for the performance gap between the two?

xiaobochen123 · 2024-12-10T06:41:46Z

@sitabulaixizawaluduo Sorry, I don’t have L40, so I’m unable to perform specific tests or analyze the performance directly. However, there might be a few possible reasons: the current MoE-EP implementation in Triton do not have been fully tuned or configured with the best settings; additionally, Expert Parallel might face load balancing issues when concurrency is low.

sitabulaixizawaluduo · 2024-12-10T06:47:42Z

@sitabulaixizawaluduo Sorry, I don’t have L40, so I’m unable to perform specific tests or analyze the performance directly. However, there might be a few possible reasons: the current MoE-EP implementation in Triton do not have been fully tuned or configured with the best settings; additionally, Expert Parallel might face load balancing issues when concurrency is low.

Thanks!

xiaobochen123 requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners November 26, 2024 12:58

xiaobochen123 force-pushed the dev/moe_ep branch from 6d22215 to c611608 Compare November 26, 2024 13:02

zhyncs assigned ispobock and zhyncs Nov 26, 2024

zhyncs added enhancement New feature or request high priority labels Nov 26, 2024

HaiShaw self-requested a review November 28, 2024 09:00

xiaobochen123 force-pushed the dev/moe_ep branch 2 times, most recently from fe21d38 to 10c405a Compare November 29, 2024 02:47

ispobock reviewed Nov 29, 2024

View reviewed changes

HaiShaw requested changes Nov 30, 2024

View reviewed changes

python/sglang/srt/layers/ep_moe/kernels.py Show resolved Hide resolved

HaiShaw requested changes Nov 30, 2024

View reviewed changes

merrymercy reviewed Dec 1, 2024

View reviewed changes

python/sglang/srt/layers/ep_moe/kernels.py Outdated Show resolved Hide resolved

test/srt/test_moe_ep.py Show resolved Hide resolved

xiaobochen123 force-pushed the dev/moe_ep branch from 3aa1e25 to fa88f23 Compare December 2, 2024 07:34

HaiShaw approved these changes Dec 4, 2024

View reviewed changes

mixtral support moe-ep

d624adf

xiaobochen123 force-pushed the dev/moe_ep branch from 3ddaed8 to d624adf Compare December 5, 2024 02:18

Merge branch 'main' into dev/moe_ep

aa0cb1b

xiaobochen123 and others added 2 commits December 5, 2024 17:47

fix bug

4876dbc

Merge branch 'main' into dev/moe_ep

a2ef180

HaiShaw merged commit f9b7c64 into sgl-project:main Dec 5, 2024
15 checks passed

ByronHsu pushed a commit that referenced this pull request Dec 6, 2024

MoE Expert Parallel Impl (#2203)

266884b

Co-authored-by: HAI <[email protected]>

xiaobochen123 mentioned this pull request Dec 6, 2024

MoE Expert Parallel #2371

Merged

merrymercy mentioned this pull request Dec 28, 2024

[Feature] Expert parallelism support #1435

Closed

2 tasks

zhaochenyang20 mentioned this pull request Jan 14, 2025

Update doc for server arguments #2742

Draft

1 task

MoE Expert Parallel Impl #2203

MoE Expert Parallel Impl #2203

Conversation

xiaobochen123 commented Nov 26, 2024 • edited Loading

Motivation

Modifications

Performance

Future Work

Update: 11/27

One layer performance results

Prefill phase

Decode phase

Checklist

zhyncs commented Nov 26, 2024

xiaobochen123 commented Nov 26, 2024

xiaobochen123 commented Nov 26, 2024

xiaobochen123 commented Nov 27, 2024

fengyang95 commented Nov 28, 2024

HaiShaw commented Nov 28, 2024

austin362667 commented Nov 28, 2024

xiaobochen123 commented Nov 29, 2024

xiaobochen123 commented Nov 29, 2024

ispobock left a comment

Choose a reason for hiding this comment

HaiShaw Nov 30, 2024

Choose a reason for hiding this comment

leo6022 Dec 2, 2024

Choose a reason for hiding this comment

HaiShaw Dec 2, 2024

Choose a reason for hiding this comment

xiaobochen123 Dec 3, 2024

Choose a reason for hiding this comment

HaiShaw left a comment

Choose a reason for hiding this comment

HaiShaw left a comment

Choose a reason for hiding this comment

merrymercy left a comment

Choose a reason for hiding this comment

xiaobochen123 commented Dec 2, 2024 • edited Loading

HaiShaw commented Dec 4, 2024

xiaobochen123 commented Dec 4, 2024 • edited Loading

HaiShaw commented Dec 5, 2024 • edited Loading

xiaobochen123 commented Dec 5, 2024

HaiShaw commented Dec 5, 2024

Mutinifni commented Dec 5, 2024 • edited Loading

xiaobochen123 commented Dec 6, 2024

sitabulaixizawaluduo commented Dec 10, 2024

xiaobochen123 commented Dec 10, 2024

sitabulaixizawaluduo commented Dec 10, 2024

xiaobochen123 commented Dec 10, 2024

sitabulaixizawaluduo commented Dec 10, 2024

xiaobochen123 commented Dec 10, 2024

sitabulaixizawaluduo commented Dec 10, 2024

xiaobochen123 commented Dec 10, 2024

sitabulaixizawaluduo commented Dec 10, 2024

xiaobochen123 commented Dec 10, 2024

sitabulaixizawaluduo commented Dec 10, 2024

xiaobochen123 commented Nov 26, 2024 •

edited

Loading

xiaobochen123 commented Dec 2, 2024 •

edited

Loading

xiaobochen123 commented Dec 4, 2024 •

edited

Loading

HaiShaw commented Dec 5, 2024 •

edited

Loading

Mutinifni commented Dec 5, 2024 •

edited

Loading