-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Mixture of Experts #181
Conversation
examples/microGPT.py
Outdated
@@ -71,10 +71,12 @@ def __init__( | |||
}, | |||
}, | |||
"feedforward_config": { | |||
"name": "FusedMLP", # Use MLP if Triton is not available | |||
"name": "MixtureOfExperts", # Use MLP if Triton is not available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pulling in MoE becomes as simple as that (though distributed training adds another layer of complication)
ca9d016
to
da276f4
Compare
Codecov Report
@@ Coverage Diff @@
## main #181 +/- ##
==========================================
- Coverage 90.81% 90.58% -0.23%
==========================================
Files 57 58 +1
Lines 2852 2922 +70
==========================================
+ Hits 2590 2647 +57
- Misses 262 275 +13
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@@ -29,8 +29,6 @@ class FusedMlpConfig(FeedforwardConfig): | |||
class FusedMLP(Feedforward): | |||
""" | |||
A MLP using fused linear layers. | |||
|
|||
.. warning: This is not currently competitive with PyTorch in terms of training speed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not true anymore :D
looks like this is wrong, the coverage loss is on _csr_ops, which is not changed.. |
2a321ef
to
f448abe
Compare
|
||
self.moe = MOELayer(gate=self.gate, experts=local_experts, group=group) | ||
|
||
self.requires_cuda = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing context here, is this used somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's an "old" flag, makes it easier for CI to test something or not test it depending on the HW needs without maintaining an escape list in different places (I think that it came from the Triton parts). We can change that, it was "a" way to solve this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
rebased, conflict resolved |
What does this PR do?
Implements Mixture of Experts as a simple Feedforward option. Uses the great MoE implementation from FairScale, implemented by @msbaines back in the days
This is really for fun and completeness' sake. Example usecase: Sparse ViT
TODO:
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.