Skip to content

0-5788719150923125/praxis

Repository files navigation

praxis

Praxis is the process by which a theory, lesson, or skill is enacted, embodied, realized, applied, or put into practice.

Terminal

what we're building

The Praxis architecture is a fluid, peer-to-peer, always online, continuously-learning and decentralized place to practice computational alchemy. With Hivemind integrated directly into core infrastructure of our ecosystem, the goal is to build a multi-modal language model that is small and simple, easy to parallelize, fault-tolerant, and performant at a scale of hundreds/thousands of self-hosted peers. We will achieve this via a sparse mixture of experts, user-curated multipath routing, symbolic decision-making and weighted self-modeling of network components.

features

  • A Mixture of Depths allows us to route just a subset of all tokens in a sequence through a layer, and to remote peers - reducing the time required for remote computation. All other tokens bypass the layer via a residual connection.
  • LayerShuffle proved that transformers can maintain coherence, even when every layer is shuffled at every forward pass. We take this a step further, and implement the PraxisController, which teaches the model how to predict an optimal route through expert layers during inference. The ability to work with out-of-order layers is crucial in a decentralized architecture, where some peers may fail, others may disappear, some may be overloaded, or undertrained, or are otherwise penalized for some reason or another...
  • As an alternative to LayerShuffle's controller, we have an experiment that implements elements from Graphformer, teaching the model to route through layers as if they were nodes in a graph.
  • In addition to the shuffling, we implement a simplified version of CALM, which allows the model to early-exit from computation.
  • We implement RoPE, ALiBi and NoPE as options for positional encoding, because they're easy, work well at sane contexts lengths, and require little to no trainable parameters.
  • Differential Attention is used to improve hallucination performance, reduce parameter counts required for attention, and filter-out noise in attention maps. Alternatively (and perhaps in-addition to, in the future), we implement an option for Stickbreaking Attention, which naturally-encodes positional information, uses a Sigmoid-based mechanism, instead of a Softmax (i.e. parameters "work together", instead of "competing" against each other). We also implement various methods from MEGA, including the Exponential Moving Average-based attention gating, and Gated Single-Head Attention modules.
  • Parameter-Efficient Expert Retrieval (PEER) from the Mixture of a Million Experts paper. Here, feedforward layers are replaced with a swarm of singleton MLP networks.
  • While simple, a Soft-Merging of Experts with Adaptive Routing class allows us to dynamically-route through a dense feedforward layer, while maintaining differentiability and enhancing expressivity.
  • We support Infini-Attention, from Leave No Context Behind, to reduce the O(n^2) memory complexity of transformer attention to O(n). This is the same technique that Google uses in Gemini.
  • We have a Kolmogorov-Arnold Networks experiment, which replaces MLPs with KANs.
  • We implement an optional Byte Latent Tokenizer, which allows us to represent tokens as patches of byte-sequences, instead of discrete tokens. This way, we can remove the tokenizer - and represent data in much more interesting ways, within the latent space.
  • We support Hyper-Connections, which are an alternative to residual connections.
  • There's also a mobile app, and a remote controller, called "Axis". We used Godot for that.

join us

install

Setup a virtual environment:

source make-venv.sh

Alternatively, you may use the VSCode command bar (Ctrl + Shift + P), and choose: Python: Create Environment...

Then, install dependencies:

# Install training dependencies
pip install -e .[all]

contribute to the swarm

To donate your compute:

python run.py

To view all supported command-line arguments:

python run.py --help

recommendations

We recommend you use a batch_size of at least 16, if possible:

python run.py --batch_size 16

The reason for this is that we have implemented an oversampling mechanism, which can expose your model to longer sequences during training (improving generalization and maximum supported sequence length). This oversampling mechanism periodically doubles the sequence length, and scales quadratically at batch sizes of 1, 4, and 16.

do inference

Send a JSON-encoded payload via POST to:

http://localhost:2100/input

This payload should support all arguments in the Transformers text generation API.

Example request:

import requests

url = "http://localhost:5000/input"
payload = {"prompt": "Once upon a time, ", "do_sample": True, "temperature": 0.7}

response = requests.post(url, json=payload)

print(response.status_code)
print(response.json())

local web chat (coming soon!)

Chat and swarm management interface is available here:

http://localhost:2100

mobile app

We're building a mobile app, to control your experts! You can see that code in the ./axis directory.

to register with transformers

from transformers import AutoConfig, AutoModel, AutoModelForCausalLM, AutoTokenizer
from praxis import PraxisConfig, PraxisForCausalLM, PraxisModel

AutoConfig.register("praxis", PraxisConfig)
AutoModel.register(PraxisConfig, PraxisModel)
AutoModelForCausalLM.register(PraxisConfig, PraxisForCausalLM)

config = PraxisConfig(
    embed_size=512,
    hidden_size=384,
    depth=6,
    num_heads=8,
    device_map="cuda:0",
)

tokenizer_model = "UNSAFE/praxis-4096"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model)

model = AutoModelForCausalLM.from_config(config)

input_ids = tokenizer.encode("The quick brown fox ")

outputs = model.generate(input_ids, do_sample=True)

print(self.tokenizer.decode(outputs[0], skip_special_tokens=True))
# --> The quick brown fox jumped over a lazy dog.

goals

  • a global swarm
  • self-modeling makes peers less-complex, and easier to model (for other AI)
  • layers as experts; a marketplace of expert, composable "blocks"
  • commit to yourself
  • cascade-style token routing (ping -> pang -> pong -> ping) via a Mixture of Depths; cyclical graph computation
  • treat every peer as an experiment in hyperparameter search; publish results to the DHT, and ensure that well-performing hparams are assigned more often
  • build adapters/connectors, allowing people to integrate their nodes with external data sources

notes, ideas and random things I want to remember

won't do

  • cryptocurrency (donations are appreciated, though!)