-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs #6964
base: master
Are you sure you want to change the base?
Conversation
…MoE and DeepSeek-V2 models.
Hi @gyou2021 , There is another PR for DeepSeek autotp in this link #6937 What is relationship of your PR to this previous PR regarding on functionality? @Yejing-Lai can you take a look at this PR? |
The difference lies in how the results of the weighted sum of routed experts per layer in the MoE are gathered. In my understanding, each selected routed expert per layer was gathered individually in #6937, meaning the gathering time was proportional to the number of selected routed experts per layer. In this PR, the results are gathered once per layer, regardless of the number of selected routed experts. |
Hi @gyou2021 , thanks for this optimization. As we reviewed internally, need to figure out a way to turn on this optimization through knobs in model config.json file. In that way this optimization won't break workloads that does not utilize this optimization. Plus: I think it will be even more user friendly if the colasced allredcuce could be injected by AutoTP. But as I reviewed with @gyou2021 , this seems tricky because no proper injection point can be captured by process nn.module. @tohtana @loadams @tjruwase @Yejing-Lai if you have an idea we can discuss in this thread. |
Reduced the routed experts' AllReduce operation times per MoE layer to ONCE for the Qwen2-MoE and DeepSeek-V2 models. The results of all selected routed experts per layer on GPU/HPU cards will be gathered ONCE using the AllReduce operation, instead of gathering each selected routed expert individually or by the number of selected routed experts. This change will greatly increase performance.
In addition to modifying auto_tp.py, the following files should be updated: modeling_qwen2_moe.py and modeling_deepseek_v2.py. Add the following code after the weighted sum of the output of the selected experts per MoE layer.
if is_deepspeed_available():
from deepspeed import comm as dist
if dist.is_initialized():
dist.all_reduce(final_hidden_states, op=dist.ReduceOp.SUM)
Notes: final_hidden_states is the result of the weighted sum of the output of the selected experts per MoE layer.