You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Idea is to have multiple instances of each expert, so a pool of expert number M that router request to. So that router instance number scales with user requests number and expert instances number scales at a slower pace (i.e. more optimized expert use). (But maybe you already implement something similar?)
Then you can have stats of which expert is used the most and train new models accordingly.
I have a question about why token-level routing is used and not full-user-request routing? Is latency an issue then?
The text was updated successfully, but these errors were encountered:
Idea is to have multiple instances of each expert, so a pool of expert number M that router request to. So that router instance number scales with user requests number and expert instances number scales at a slower pace (i.e. more optimized expert use). (But maybe you already implement something similar?)
Then you can have stats of which expert is used the most and train new models accordingly.
I have a question about why token-level routing is used and not full-user-request routing? Is latency an issue then?
The text was updated successfully, but these errors were encountered: