ByteLLM is a project aimed at bridging the final gap in end-to-end training of Large Language Models (LLMs) by using a byte-level tokenizer, ByteTokenizer. Although the significance of this approach might not be immediately apparent, it opens up intriguing possibilities for the future of LLM training.
The vision of ByteLLM is to demonstrate that byte-level tokenization can be directly integrated into existing LLMs, providing a novel approach to model training and usage.
Byte-level tokenization offers a unique perspective in the field of natural language processing (NLP). By focusing on bytes instead of characters or subwords, ByteTokenizer aims to simplify and enhance the training process of LLMs.
-
MegaByte
- Paper: MegaByte
- Implementation: MEGABYTE-pytorch
-
MambaByte
- Paper: MambaByte
- Byte-Level Tokenization: Simplifies the tokenization process by focusing on bytes.
- End-to-End Training: Enables seamless end-to-end training of LLMs.
- Compatibility: Can be integrated with existing LLM frameworks.
Train
python src/train.py --train_config configs/train.yaml --model_config configs/model_configs/gpt2_small.yaml --use_byte_tokenizer
Train with checkpoint
python src/train.py --train_config configs/train.yaml --model_config configs/model_configs/gpt2_small.yaml --use_byte_tokenizer --resume_from_checkpoint /path/to/your/checkpoint
Test
python src/test.py --test_config configs/test.yaml --model_config configs/model_configs/gpt2_small.yaml --use_byte_tokenizer
ByteLLM/
│
├── configs/
│ ├── model_configs/
│ │ ├── model_a.yaml
│ │ └── model_b.yaml
│ ├── train.yaml
│ └── test.yaml
│
├── src/
│ ├── models/
│ │ ├── __init__.py
│ │ └── custom_models.py
│ ├── utils/
│ │ ├── __init__.py
│ │ └── byte_tokenizer.py
│ ├── train.py
│ └── test.py
│
│
├── requirements.txt
└── README.md