Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs #6964

gyou2021 · 2025-01-21T08:18:52Z

Reduced the routed experts' AllReduce operation times per MoE layer to ONCE for the Qwen2-MoE and DeepSeek-V2 models. The results of all selected routed experts per layer on GPU/HPU cards will be gathered ONCE using the AllReduce operation, instead of gathering each selected routed expert individually or by the number of selected routed experts. This change will greatly increase performance.
In addition to modifying auto_tp.py, the following files should be updated: modeling_qwen2_moe.py and modeling_deepseek_v2.py. Add the following code after the weighted sum of the output of the selected experts per MoE layer.
if is_deepspeed_available():
from deepspeed import comm as dist
if dist.is_initialized():
dist.all_reduce(final_hidden_states, op=dist.ReduceOp.SUM)
Notes: final_hidden_states is the result of the weighted sum of the output of the selected experts per MoE layer.

…MoE and DeepSeek-V2 models.

delock · 2025-01-21T08:46:13Z

Hi @gyou2021 , There is another PR for DeepSeek autotp in this link #6937 What is relationship of your PR to this previous PR regarding on functionality?

@Yejing-Lai can you take a look at this PR?

deepspeed/module_inject/auto_tp.py

gyou2021 · 2025-01-21T09:18:41Z

Hi @gyou2021 , There is another PR for DeepSeek autotp in this link #6937 What is relationship of your PR to this previous PR regarding on functionality?

@Yejing-Lai can you take a look at this PR?

The difference lies in how the results of the weighted sum of routed experts per layer in the MoE are gathered. In my understanding, each selected routed expert per layer was gathered individually in #6937, meaning the gathering time was proportional to the number of selected routed experts per layer. In this PR, the results are gathered once per layer, regardless of the number of selected routed experts.

…pSeekv2

delock · 2025-02-11T01:09:58Z

Hi @gyou2021 , thanks for this optimization. As we reviewed internally, need to figure out a way to turn on this optimization through knobs in model config.json file. In that way this optimization won't break workloads that does not utilize this optimization.

Plus: I think it will be even more user friendly if the colasced allredcuce could be injected by AutoTP. But as I reviewed with @gyou2021 , this seems tricky because no proper injection point can be captured by process nn.module. @tohtana @loadams @tjruwase @Yejing-Lai if you have an idea we can discuss in this thread.

gyou2021 added 2 commits January 21, 2025 06:55

Reduced the experts allreduce number per layer to ONCE for the Qwen2-…

9adcc49

…MoE and DeepSeek-V2 models.

Fixed format

d71c993

gyou2021 requested review from hwchen2017 and loadams as code owners January 21, 2025 08:18

delock reviewed Jan 21, 2025

View reviewed changes

deepspeed/module_inject/auto_tp.py Outdated Show resolved Hide resolved

Removed print

89e1ae1

gyou2021 and others added 5 commits January 21, 2025 10:12

Fix a bug about set.

505eae2

Merge branch 'master' into autoTP_Qwen2Moe_DeepSeekv2

300a60e

Merge branch 'master' into autoTP_Qwen2Moe_DeepSeekv2

d0b9a2a

Fixed conflicts.

f680115

Merge remote-tracking branch 'origin/master' into autoTP_Qwen2Moe_Dee…

be7182e

…pSeekv2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs #6964

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs #6964

gyou2021 commented Jan 21, 2025 •

edited

Loading

delock commented Jan 21, 2025

gyou2021 commented Jan 21, 2025

delock commented Feb 11, 2025

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs #6964

Are you sure you want to change the base?

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs #6964

Conversation

gyou2021 commented Jan 21, 2025 • edited Loading

delock commented Jan 21, 2025

gyou2021 commented Jan 21, 2025

delock commented Feb 11, 2025

gyou2021 commented Jan 21, 2025 •

edited

Loading