You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sorry, I have some questions to ask:
1、If I set num_local_experts = 2, it means that every gpu has two experts? and the two expert parameters exist on the one gpu?
2、If I set num_local_experts = -2, it means that two gpus share one expert? and how the one expert parameters are distributed on two gpus?
3、When I use Data Parallelism with Tutel , All training processes on one gpu only can use the expert distributed on the gpu? Is it possible to have Cross-node communication through the Moe layer?
4、When I use Pipeline Parallel with Tutel , in order to reduce communication, it is best to place experts on specific gpu,Can I set the distribution of experts on which GPU by myself ?
The text was updated successfully, but these errors were encountered:
Each of two GPUs will store 1/2 parameters from one expert. For examples, regarding 4 GPUs maintaining 2 experts A and B, the parameter distribution on 4 GPUs will be: 1/2 of A, 1/2 of A, 1/2 of B, 1/2 of B.
Can you explain it clearly? It has already contained cross-node communication in MoE layer.
Different MoE groups can be placed to specific gpu just by setting custom processed group during the creation of moe_layer(). However, within a single MoE group, expert placement is specially designed and changing it will break the distributed algorithm in it.
Sorry, I have some questions to ask:
1、If I set num_local_experts = 2, it means that every gpu has two experts? and the two expert parameters exist on the one gpu?
2、If I set num_local_experts = -2, it means that two gpus share one expert? and how the one expert parameters are distributed on two gpus?
3、When I use Data Parallelism with Tutel , All training processes on one gpu only can use the expert distributed on the gpu? Is it possible to have Cross-node communication through the Moe layer?
4、When I use Pipeline Parallel with Tutel , in order to reduce communication, it is best to place experts on specific gpu,Can I set the distribution of experts on which GPU by myself ?
The text was updated successfully, but these errors were encountered: