How expert parameters are distributed in the cluster when using the Tutel framework? #251

luuck · 2024-10-30T12:53:48Z

Sorry, I have some questions to ask：
1、If I set num_local_experts = 2, it means that every gpu has two experts? and the two expert parameters exist on the one gpu?
2、If I set num_local_experts = -2, it means that two gpus share one expert? and how the one expert parameters are distributed on two gpus?
3、When I use Data Parallelism with Tutel , All training processes on one gpu only can use the expert distributed on the gpu? Is it possible to have Cross-node communication through the Moe layer？
4、When I use Pipeline Parallel with Tutel , in order to reduce communication, it is best to place experts on specific gpu，Can I set the distribution of experts on which GPU by myself ？

ghostplant · 2024-10-30T21:26:09Z

Here are the answers:

Yes.
Each of two GPUs will store 1/2 parameters from one expert. For examples, regarding 4 GPUs maintaining 2 experts A and B, the parameter distribution on 4 GPUs will be: 1/2 of A, 1/2 of A, 1/2 of B, 1/2 of B.
Can you explain it clearly? It has already contained cross-node communication in MoE layer.
Different MoE groups can be placed to specific gpu just by setting custom processed group during the creation of moe_layer(). However, within a single MoE group, expert placement is specially designed and changing it will break the distributed algorithm in it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How expert parameters are distributed in the cluster when using the Tutel framework? #251

How expert parameters are distributed in the cluster when using the Tutel framework? #251

luuck commented Oct 30, 2024

ghostplant commented Oct 30, 2024

How expert parameters are distributed in the cluster when using the Tutel framework? #251

How expert parameters are distributed in the cluster when using the Tutel framework? #251

Comments

luuck commented Oct 30, 2024

ghostplant commented Oct 30, 2024