Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoE Expert Parallel Impl #2203

Merged
merged 12 commits into from
Dec 5, 2024
Merged

MoE Expert Parallel Impl #2203

merged 12 commits into from
Dec 5, 2024

Conversation

xiaobochen123
Copy link
Contributor

@xiaobochen123 xiaobochen123 commented Nov 26, 2024

Motivation

This is the implementation of MoE Expert Parallel, seamlessly integrated with TP. The expert parallel size (ep-size) matches the tensor parallel size (tp-size).
It supports the following formats:

  • [ -] FP16/BF16
  • [ -] FP8:Dynamic/Static

I used CUDA/CUTLASS to implement these operators. However, integrating CUDA/CUTLASS into SGLang is difficult. Therefore, I rewrote the implementation using Triton. Overall, the Triton implementation performs well, except for Grouped-GEMM, where its performance is significantly lower than the CUTLASS implementation (several times slower on H100).

Modifications

  • Add MoE EP Layer and kernels.
  • Add unit test for moe-ep
  • Add server args: --enable-ep-moe. If want to use moe-ep, just add --enable-ep-moe

Performance

Note: EP is not yet at its optimal performance. The final performance of EP parallelism will better than TP.

  • This PR: TP + DP + EP

  • Base : TP + DP

  • Test Model: neuralmagic/DeepSeek-Coder-V2-Instruct-FP8

  • Hardward: 8xH100

  • Prefill:
    ** Base: 21004.67 tokens/s
    ** This PR: 26039.52 tokens/s

  • Decode:
    ** Base : 10738.72 tokens/s
    ** This PR: 10591.17 tokens/s

# Base:
python3 -m sglang.launch_server             \
    --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 \
    --trust-remote-code                     \
    --disable-radix-cache                   \
    --quantization "fp8"                    \
    --kv-cache-dtype "fp8_e5m2"             \
    --mem-fraction-static 0.7              \
    --max-running-requests 4096             \
    --max-prefill-tokens  16384             \
    --chunked-prefill-size 16384            \
    --tp 8                                  \
    --dp 8 --enable-dp-attention

# This PR
python3 -m sglang.launch_server             \
    --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 \
    --trust-remote-code                     \
    --disable-radix-cache                   \
    --quantization "fp8"                    \
    --kv-cache-dtype "fp8_e5m2"             \
    --mem-fraction-static 0.7              \
    --max-running-requests 4096             \
    --max-prefill-tokens  16384             \
    --chunked-prefill-size 16384            \
    --tp 8                                  \
    --dp 8 --enable-dp-attention            \
    --enable-ep-moe

# Test
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 512 --random-output 1 --random-range-ratio 1 --num-prompts 10000

python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 10000

Future Work

The work on SGLang has been basically completed. Moving forward, further performance optimization for Grouped-GEMM is required.
I plan to integrate the related optimizations for Grouped-GEMM (CUTLASS) into FlashInfer, enabling SGLang to access high-performance Grouped-GEMM implementations through FlashInfer. Additionally, further optimization of the Triton Grouped-GEMM implementation is necessary.

Update: 11/27

I do a fine-grained performance analysis of MoE-EP on neuralmagic/DeepSeek-Coder-V2-Instruct-FP8

One layer performance results

Prefill phase

The table below shows the time of each module in a specific layer during the Prefill phase, measured in milliseconds (ms).
image

It can be observed that the MoE-EP kernel has achieved approximately a 3x performance improvement compared to TP. However, since the MoE component constitutes only a small portion (with Attention being the major contributor), the overall model performance improvement in testing is limited.

Base(TP) = 13.8 ms
ThisPR(TP+EP) = 11.2ms
13.8/11.2 = 1.23
26039.52/21004.67 = 1.24
It can be seen that these two numbers are essentially consistent.

  • Base(TP)-Nsys
    image

  • ThisPR(TP+EP)-Nsys
    image

Decode phase

The decode phase was also analyzed.
image
Base(TP) = 4.3ms
ThisPR(TP+EP) = 4.3ms(Approximately)

From the kernel perspective, there is also an improvement in the MoE section during the decode phase. However, due to MoE expert load balancing, the computation time varies across GPUs (with the worst-case scenario pulling down the lower bound). As a result, the time difference is minimal, which aligns with the E2E test results.

Additionally, even if the load is well-balanced, the MoE component still makes up a small portion of the model (similar to the prefill phase, where Attention dominates). Even with further optimization of MoE, the overall improvement would still be minimal.

  • Base(TP)-Nsys
    image

  • ThisPR(TP+EP)-Nsys
    image

In summary, the current MoE-EP kernel shows a 1-3x improvement over MoE-TP. However, since the MoE component makes up a small portion of DeepSeek-V2, the overall improvement in the E2E test is not significant. (Of course, the MoE-EP Triton implementation can still be further improved.)

Additionally, based on the nsys results, the current MLA implementation can still be further improved. I will submit my implementation for it in the future.

Checklist

  • [ -] Format your code according to the Contributor Guide.
  • [ -] Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs
Copy link
Member

zhyncs commented Nov 26, 2024

It's amazing! Please join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2rtikx2pv-DUfPrhx2SaNAq~47YtV1XQ

I used CUDA/CUTLASS to implement these operators. However, integrating CUDA/CUTLASS into SGLang is difficult. Therefore, I rewrote the implementation using Triton.

We currently have a kernel library, and I can add you to the team for collaboration.

@xiaobochen123
Copy link
Contributor Author

It's amazing! Please join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2rtikx2pv-DUfPrhx2SaNAq~47YtV1XQ

I used CUDA/CUTLASS to implement these operators. However, integrating CUDA/CUTLASS into SGLang is difficult. Therefore, I rewrote the implementation using Triton.

We currently have a kernel library, and I can add you to the team for collaboration.

I'm in the slack channel

@xiaobochen123
Copy link
Contributor Author

Not sure why one of the unit tests failed. It seems unrelated to this pr.

@xiaobochen123
Copy link
Contributor Author

I have updated the latest performance analysis results. See Update: 11/27. @zhyncs @ispobock

@HaiShaw HaiShaw self-requested a review November 28, 2024 09:00
@fengyang95
Copy link

@xiaobochen123 Does this rely on dp-attention? Also, from what I understand, ep shouldn't add any extra memory burden, right?

@HaiShaw
Copy link
Collaborator

HaiShaw commented Nov 28, 2024

@xiaobochen123 May you highlight how TP is used along with EP within MoE layer here?
Btw, do & how we count in the traffic imbalance criteria into benchmarking? Thanks!

@austin362667
Copy link
Contributor

Amazing works, @xiaobochen123 This is super dope!

@xiaobochen123
Copy link
Contributor Author

@xiaobochen123 Does this rely on dp-attention? Also, from what I understand, ep shouldn't add any extra memory burden, right?

@fengyang95 1. Does not rely on DP-attention. 2. Does not add extra memory.

@xiaobochen123
Copy link
Contributor Author

@xiaobochen123 May you highlight how TP is used along with EP within MoE layer here? Btw, do & how we count in the traffic imbalance criteria into benchmarking? Thanks!

@HaiShaw When enables EP, the MoE module is EP, with no TP (EP-Size = TP-Size). The Attention is TP. Regarding the imbalance issue, you can perform performance analysis using nsys. If the MoE execution time differs across GPUs, you can get it. For more precise analysis, you can collect the top-k information for each layer.

@xiaobochen123 xiaobochen123 force-pushed the dev/moe_ep branch 2 times, most recently from fe21d38 to 10c405a Compare November 29, 2024 02:47
Copy link
Collaborator

@ispobock ispobock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaobochen123 Thanks for the awesome contribution! Overall LGTM. Left some comments for minor changes.

python/sglang/srt/layers/ep_moe/kernels.py Outdated Show resolved Hide resolved
python/sglang/srt/layers/ep_moe/kernels.py Show resolved Hide resolved
python/sglang/srt/layers/ep_moe/layer.py Outdated Show resolved Hide resolved
python/sglang/srt/layers/ep_moe/kernels.py Outdated Show resolved Hide resolved
python/sglang/srt/layers/ep_moe/layer.py Outdated Show resolved Hide resolved
python/sglang/srt/layers/ep_moe/kernels.py Outdated Show resolved Hide resolved
python/sglang/srt/layers/ep_moe/kernels.py Show resolved Hide resolved
python/sglang/srt/layers/ep_moe/kernels.py Outdated Show resolved Hide resolved
python/sglang/srt/layers/ep_moe/kernels.py Outdated Show resolved Hide resolved
python/sglang/srt/layers/ep_moe/kernels.py Outdated Show resolved Hide resolved

BLOCK_SIZE_M = 128
BLOCK_SIZE_N = 128
BLOCK_SIZE_K = 128
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make them kargs, so tunable?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have room for improvement. I am not very familiar with triton. Is there a reference demo for triton tune?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most kargs as in https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/fused_moe_triton/fused_moe.py can be tuned.
For the tuning, we can refer to https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the MoE-TP Triton kernel, the best config is selected based on M, where M is hidden_states.size(0), which is a fixed value.
In MoE-EP, however, M can vary across layers or experts, which may pose some challenges in identifying a configuration that works well for all cases.
This primarily focuses on further optimizing performance and don`t affect the current functionality.

Copy link
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - can we add use_fp8_w8a8 etc. key flags as kargs?
Thanks!

python/sglang/srt/layers/ep_moe/kernels.py Show resolved Hide resolved
Copy link
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaobochen123 , very nice work, and look forward to more contributions from you!
Several general comments:

  • Would you please add some more comments, to new function and/or kernel for example?
  • Can we follow similar/closer naming (variable, func, kernel) conventions from existing fused_moe.py?
  • What is the assumption we have from quantization (fp8, etc.) to EP support, e.g. should TP quantized model work seamlessly with EP?
  • --tp 8 --enable-ep-moe seemed to be confusion, as TP/EP doesn't coexist, possible to introduce --ep <NUM> or similar kind?

Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! Regards the CUTLASS integration, we also opened a new folder to start integrating more GPU kernel code here https://github.com/sgl-project/sglang/tree/main/sgl-kernel.
We plan to gradually add more kernels like custom allreduce, fp8 gemm there. You can also put the CUTLASS group GEMM there. It will be a separate python package but we use the sglang mono code repo to host it so we can share the CI resources and avoid the sync between two repos.

This is a very good contribution so we would like to merge the triton one as soon as possible. We can then do the CUTLASS integration in follow up PRs.

python/sglang/srt/layers/ep_moe/kernels.py Outdated Show resolved Hide resolved
test/srt/test_moe_ep.py Show resolved Hide resolved
@xiaobochen123
Copy link
Contributor Author

xiaobochen123 commented Dec 2, 2024

@xiaobochen123 , very nice work, and look forward to more contributions from you! Several general comments:

  • Would you please add some more comments, to new function and/or kernel for example?
  • Can we follow similar/closer naming (variable, func, kernel) conventions from existing fused_moe.py?
  • What is the assumption we have from quantization (fp8, etc.) to EP support, e.g. should TP quantized model work seamlessly with EP?
  • --tp 8 --enable-ep-moe seemed to be confusion, as TP/EP doesn't coexist, possible to introduce --ep <NUM> or similar kind?
  1. similar/closer naming with fused_moe.py, could you tell me more about what need fix.
  2. I think that TP quantized models are seamless to EP. Because when TP quantizes, it calculates a scale according to each expert, which does not affect EP.
  3. I have add —ep-size args。
    @HaiShaw

@HaiShaw
Copy link
Collaborator

HaiShaw commented Dec 4, 2024

to merge, thanks.

@xiaobochen123
Copy link
Contributor Author

xiaobochen123 commented Dec 4, 2024

@xiaobochen123 nice update - can you give a try to below model too? amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV

Accuracy result, you may ignore for now, just check if it runs to completion.

@HaiShaw I tested the model, and it`s ok

@HaiShaw
Copy link
Collaborator

HaiShaw commented Dec 5, 2024

@xiaobochen123 May you update branch, prepare for merge?

@xiaobochen123
Copy link
Contributor Author

@xiaobochen123 May you update branch, prepare for merge?

@HaiShaw Updated

@HaiShaw
Copy link
Collaborator

HaiShaw commented Dec 5, 2024

@xiaobochen123 can you fix this CI error:

  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/models/mixtral.py", line 82, in __init__
    MoEImpl = EPMoE if global_server_args_dict["enable_ep_moe"] else FusedMoE
NameError: name 'global_server_args_dict' is not defined

@HaiShaw HaiShaw merged commit f9b7c64 into sgl-project:main Dec 5, 2024
15 checks passed
@Mutinifni
Copy link
Contributor

Mutinifni commented Dec 5, 2024

This PR is great, thanks @xiaobochen123!

I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.

Here are the benchmarking scripts:

model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe

The output token throughputs on DGX-A100 are:
TP: 1027.46 tok/s
EP: 904.29 tok/s

Does this align with your expectation given your above benchmarks? Thanks!

Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

@xiaobochen123
Copy link
Contributor Author

This PR is great, thanks @xiaobochen123!

I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.

Here are the benchmarking scripts:

model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe

The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s

Does this align with your expectation given your above benchmarks? Thanks!

Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.

ByronHsu pushed a commit that referenced this pull request Dec 6, 2024
@sitabulaixizawaluduo
Copy link

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:

model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe

The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.

If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?

@xiaobochen123
Copy link
Contributor Author

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:

model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe

The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.

If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?

@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.

@sitabulaixizawaluduo
Copy link

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:

model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe

The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.

If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?

@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.

Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation?

@xiaobochen123
Copy link
Contributor Author

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:

model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe

The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.

If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?

@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.

Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation?

@sitabulaixizawaluduo Sorry, I might not have fully understood your point. Why would TP + EP be used in an MLP? From my understanding, an MLP uses either TP or EP.

@sitabulaixizawaluduo
Copy link

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:

model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe

The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.

If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?

@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.

Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation?

@sitabulaixizawaluduo Sorry, I might not have fully understood your point. Why would TP + EP be used in an MLP? From my understanding, an MLP uses either TP or EP.

For example, a model with 8 experts performs four-card calculations, tp_size = 2, ep_size = 2, in the MLP layer, GPU0 and GPU1, using 4 experts, this 4 experts do parallel computing with tp

@xiaobochen123
Copy link
Contributor Author

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:

model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe

The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.

If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?

@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.

Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation?

@sitabulaixizawaluduo Sorry, I might not have fully understood your point. Why would TP + EP be used in an MLP? From my understanding, an MLP uses either TP or EP.

For example, a model with 8 experts performs four-card calculations, tp_size = 2, ep_size = 2, in the MLP layer, GPU0 and GPU1, using 4 experts, this 4 experts do parallel computing with tp

@sitabulaixizawaluduo TP requires one communication, and EP adds another. Could this result in degraded performance?

@sitabulaixizawaluduo
Copy link

This PR is great, thanks @xiaobochen123!
I am running a benchmark with your code without enabling DP attention, and I consistently see lower performance for EP than TP.
Here are the benchmarking scripts:

model=/ssd/models/deepseek-ai/DeepSeek-Coder-V2-Instruct

# TP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8

# EP
python -m sglang.bench_offline_throughput \
    --model-path $model \
    --num-prompts 1000 \
    --trust-remote-code \
    --tp-size 8 \
    --ep-size 8 \
    --enable-ep-moe

The output token throughputs on DGX-A100 are: TP: 1027.46 tok/s EP: 904.29 tok/s
Does this align with your expectation given your above benchmarks? Thanks!
Minor comment: although the --ep-size argument exists, it currently requires --tp-size to be set. It might be better to not require --tp-size if EP is used or maybe specify attn vs. ffn in the args?

@Mutinifni Based on my tests, the output is slightly worse than TP. While ep-size is enabled, tp-size is still required because attention relies on TP. I added ep-size based on dp-size, and their behaviors should be quite similar.

If set tp-size = 2, ep-size = 2, --enable-ep-moe, is it only ep or hybrid parallel now?

@sitabulaixizawaluduo It`s hybrid parallel. MoE-EP, Attention-TP.

Thanks for reply, I still have a question, is the calculation speed of Only EP necessarily faster than that of TP + EP in MLP calculation?

@sitabulaixizawaluduo Sorry, I might not have fully understood your point. Why would TP + EP be used in an MLP? From my understanding, an MLP uses either TP or EP.

For example, a model with 8 experts performs four-card calculations, tp_size = 2, ep_size = 2, in the MLP layer, GPU0 and GPU1, using 4 experts, this 4 experts do parallel computing with tp

@sitabulaixizawaluduo TP requires one communication, and EP adds another. Could this result in degraded performance?

After TP, the amount of MLP computation on each gpu is reduced. In this case, the worst case (the deployed experts on the different devices) The traffic is the same as TP, but if the deployed experts one the same device, the communication speed gpu memory bandwidth > nvlink > > PCIE

@xiaobochen123
Copy link
Contributor Author

@sitabulaixizawaluduo From my understanding, the overall computation should be the same for TP=4, EP=4, and TP=2/EP=2. It’s unlikely that TP=2/EP=2 would have less computation. In my current implementation, MoE-EP uses AllReduce, and the communication cost is the same as TP.

@sitabulaixizawaluduo
Copy link

@sitabulaixizawaluduo From my understanding, the overall computation should be the same for TP=4, EP=4, and TP=2/EP=2. It’s unlikely that TP=2/EP=2 would have less computation. In my current implementation, MoE-EP uses AllReduce, and the communication cost is the same as TP.

Thanks! I will recalculate the amount of forward communication and computation for this part. By the way, when I tested the response time of a single request( input length 1000 - 2000) on the L40 (PCIE), I found that the TP speed is faster than that of the EP. When TP is deployed: --tp-size = 2, when EP is deployed: --tp-size = 2 --enable-ep-moe, what is the reason for the performance gap between the two?

@xiaobochen123
Copy link
Contributor Author

@sitabulaixizawaluduo Sorry, I don’t have L40, so I’m unable to perform specific tests or analyze the performance directly. However, there might be a few possible reasons: the current MoE-EP implementation in Triton do not have been fully tuned or configured with the best settings; additionally, Expert Parallel might face load balancing issues when concurrency is low.

@sitabulaixizawaluduo
Copy link

@sitabulaixizawaluduo Sorry, I don’t have L40, so I’m unable to perform specific tests or analyze the performance directly. However, there might be a few possible reasons: the current MoE-EP implementation in Triton do not have been fully tuned or configured with the best settings; additionally, Expert Parallel might face load balancing issues when concurrency is low.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants