-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We need to develop a Hivemind strategy #27
Comments
The more I experiment with LayerShuffle, the less I feel like it could ever possibly work here. No matter what I've tried, LayerShuffle leads to complete model degeneration, like this: We are going to need a different kind of decentralization strategy. I think that a graph-based approach is one potential option, though it's not clear to me how that would need to work. Another potential option would be a swarm-based/ensemble approach, where many tiny, independent models are asked to work in-tandem with one another, towards a common goal. Certainly, this is the approach that most AI organizations are using today, with multi-agent orchestration tooling, and Chain of Thought prompting. One model generates an output, which is passed to another in "plain text," which is passed to another... many, many times - until a final output is created. Of course, the main challenge here is that of speed and compute; routing through a single transformer on desktop compute is already hard, but routing through many of them is even harder. It splits the computation graph across many independent models, and it would require training many independent models, simultaneously. Not to mention, with such small models - actually making any of them "behave" correctly would be a very real, potentially impossible task. I don't particularly like either of these options, but it's where we stand. |
One thing we should consider is how to integrate peers with differing levels of capability. For example, my machine might be able to run experts with a hidden_size of 256, but yours might be capable of running experts with a hidden_size of 512. In such a scenario, Hivemind communications are not compatible - because Hivemind expects a constant tensor size between peers. The simplest option would be to force a linear projection, where necessary. But there are other options.
There are many approaches. I'm not sure which would be most appropriate, at this point. |
I do not have a clear idea about how Hivemind should be integrated, yet. Let this issue exist to document a discussion around potential solutions.
Currently, each layer in a decoder is called an "expert", and we attach each expert to the Hivemind as an independent layer/expert, available for computation. When connected to Hivemind, a peer will automatically crawl the DHT, in search of active experts. If one is found (and that peer's cache of usable experts is not full), the new expert will be automatically-added to their machine - and it will be available for computations.
Currently, remote experts will never be used, unless the
--shuffle
argument is also being used. A model does not know how to route through a remote peer, without the random permutations introduced by shuffling during training. The exact method for permutation may need to change (I am not having much luck with random, naive shuffling, right now).Now, there are clear problems with this design:
Anyway, those are my thoughts. More to come.
The text was updated successfully, but these errors were encountered: