Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need to develop a Hivemind strategy #27

Open
Vectorrent opened this issue Nov 29, 2024 · 2 comments
Open

We need to develop a Hivemind strategy #27

Vectorrent opened this issue Nov 29, 2024 · 2 comments
Labels
help wanted Extra attention is needed

Comments

@Vectorrent
Copy link
Contributor

I do not have a clear idea about how Hivemind should be integrated, yet. Let this issue exist to document a discussion around potential solutions.

Currently, each layer in a decoder is called an "expert", and we attach each expert to the Hivemind as an independent layer/expert, available for computation. When connected to Hivemind, a peer will automatically crawl the DHT, in search of active experts. If one is found (and that peer's cache of usable experts is not full), the new expert will be automatically-added to their machine - and it will be available for computations.

Currently, remote experts will never be used, unless the --shuffle argument is also being used. A model does not know how to route through a remote peer, without the random permutations introduced by shuffling during training. The exact method for permutation may need to change (I am not having much luck with random, naive shuffling, right now).

Now, there are clear problems with this design:

  • Local and remote optimizers are not connected, nor could we really expect them to be. If we were to add many remote experts to our local optimizer, VRAM consumption would grow dramatically, speeds would decrease, and we would be giving remote/untrusted peers A LOT of influence over the state of our model (which may not be desirable).
  • So, without optimizers, we currently treat remote experts as if they are frozen, static layers - without any trainable parameters. This is a well-known strategy often employed in continuous-learning strategies (freezing some expert modules, in an MoE, for example).
  • However, our situation is somewhat worse than that - because we do not even "know" what parameters that remote expert has. Unlike the well-known continuous-learning approaches, the only thing we know is the input and output shape of that remote expert. Thus, integrating the remote expert in a meaningful way becomes a challenge.
  • Currently, we just send our inputs to the remote peer, receive their response, and "integrate" them via a residual connections (which restores the gradient path, over the non-differentiable expert's outputs). This is a painfully simple strategy, and almost certainly won't be sufficient.
  • Alternative strategies include differentiable sampling (i.e. a Gumbel-Softmax), residual gating (a less-naive form of expert-influence), or - my favorite option - we download an offline/local version of that remote peer's parameters, and we train some local, differentiable parameters to respect those remote weights. I'm currently thinking that an exponential moving average over an approximation of the remote weights, with some kind of gating strategy... this may allow our local model to learn the inductive biases in the remote expert, in a differentiable yet cheap fashion. I could be completely mistaken.

Anyway, those are my thoughts. More to come.

@Vectorrent Vectorrent added the help wanted Extra attention is needed label Nov 29, 2024
@Vectorrent
Copy link
Contributor Author

Vectorrent commented Dec 1, 2024

The more I experiment with LayerShuffle, the less I feel like it could ever possibly work here. No matter what I've tried, LayerShuffle leads to complete model degeneration, like this:
image
And really, how could this ever work? A fundamental principle of the transformer architecture is the fact that, through each sequential layer, the model is learning how to transform and "compose" intermediate representations of data, with each building upon the previous layers. When you naively-shuffle those layers, you are creating an extreme form of regularization, such that every layer would need to know how to transform the hidden states of every other layer, in any possible order. Even if this could be made to work on a smaller scale, adding more layers will almost certainly exacerbate the problem further. Sequential models simply do not have such problems with model degeneration.

We are going to need a different kind of decentralization strategy.

I think that a graph-based approach is one potential option, though it's not clear to me how that would need to work.

Another potential option would be a swarm-based/ensemble approach, where many tiny, independent models are asked to work in-tandem with one another, towards a common goal. Certainly, this is the approach that most AI organizations are using today, with multi-agent orchestration tooling, and Chain of Thought prompting. One model generates an output, which is passed to another in "plain text," which is passed to another... many, many times - until a final output is created. Of course, the main challenge here is that of speed and compute; routing through a single transformer on desktop compute is already hard, but routing through many of them is even harder. It splits the computation graph across many independent models, and it would require training many independent models, simultaneously. Not to mention, with such small models - actually making any of them "behave" correctly would be a very real, potentially impossible task.

I don't particularly like either of these options, but it's where we stand.

@Vectorrent
Copy link
Contributor Author

One thing we should consider is how to integrate peers with differing levels of capability. For example, my machine might be able to run experts with a hidden_size of 256, but yours might be capable of running experts with a hidden_size of 512. In such a scenario, Hivemind communications are not compatible - because Hivemind expects a constant tensor size between peers.

The simplest option would be to force a linear projection, where necessary. But there are other options.

  • We could "slice" from tensors, before passing them through the swarm.
  • We could use linear interpolation, to upsample or downsample tensors.
  • We could use a top-k approach, to select the "most similar" values to pass.
  • Etc...

There are many approaches. I'm not sure which would be most appropriate, at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant