Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How expert parameters are distributed in the cluster when using the Tutel framework? #251

Open
luuck opened this issue Oct 30, 2024 · 1 comment

Comments

@luuck
Copy link

luuck commented Oct 30, 2024

Sorry, I have some questions to ask:
1、If I set num_local_experts = 2, it means that every gpu has two experts? and the two expert parameters exist on the one gpu?
2、If I set num_local_experts = -2, it means that two gpus share one expert? and how the one expert parameters are distributed on two gpus?
3、When I use Data Parallelism with Tutel , All training processes on one gpu only can use the expert distributed on the gpu? Is it possible to have Cross-node communication through the Moe layer?
4、When I use Pipeline Parallel with Tutel , in order to reduce communication, it is best to place experts on specific gpu,Can I set the distribution of experts on which GPU by myself ?

@ghostplant
Copy link
Contributor

Here are the answers:

  1. Yes.
  2. Each of two GPUs will store 1/2 parameters from one expert. For examples, regarding 4 GPUs maintaining 2 experts A and B, the parameter distribution on 4 GPUs will be: 1/2 of A, 1/2 of A, 1/2 of B, 1/2 of B.
  3. Can you explain it clearly? It has already contained cross-node communication in MoE layer.
  4. Different MoE groups can be placed to specific gpu just by setting custom processed group during the creation of moe_layer(). However, within a single MoE group, expert placement is specially designed and changing it will break the distributed algorithm in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants