ByteLLM

ByteLLM is a project aimed at bridging the final gap in end-to-end training of Large Language Models (LLMs) by using a byte-level tokenizer, ByteTokenizer. Although the significance of this approach might not be immediately apparent, it opens up intriguing possibilities for the future of LLM training.

Vision

The vision of ByteLLM is to demonstrate that byte-level tokenization can be directly integrated into existing LLMs, providing a novel approach to model training and usage.

Background

Byte-level tokenization offers a unique perspective in the field of natural language processing (NLP). By focusing on bytes instead of characters or subwords, ByteTokenizer aims to simplify and enhance the training process of LLMs.

Related Work

MegaByte
- Paper: MegaByte
- Implementation: MEGABYTE-pytorch
MambaByte
- Paper: MambaByte

Features

Byte-Level Tokenization: Simplifies the tokenization process by focusing on bytes.
End-to-End Training: Enables seamless end-to-end training of LLMs.
Compatibility: Can be integrated with existing LLM frameworks.

How to run

Train

python src/train.py --train_config configs/train.yaml --model_config configs/model_configs/gpt2_small.yaml --use_byte_tokenizer

Train with checkpoint

python src/train.py --train_config configs/train.yaml --model_config configs/model_configs/gpt2_small.yaml --use_byte_tokenizer --resume_from_checkpoint /path/to/your/checkpoint

Test

python src/test.py --test_config configs/test.yaml --model_config configs/model_configs/gpt2_small.yaml --use_byte_tokenizer

Project

ByteLLM/
│
├── configs/
│   ├── model_configs/
│   │   ├── model_a.yaml
│   │   └── model_b.yaml
│   ├── train.yaml
│   └── test.yaml
│
├── src/
│   ├── models/
│   │   ├── __init__.py
│   │   └── custom_models.py
│   ├── utils/
│   │   ├── __init__.py
│   │   └── byte_tokenizer.py
│   ├── train.py
│   └── test.py
│
│
├── requirements.txt
└── README.md

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ByteLLM

Vision

Background

Related Work

Features

How to run

Project

About

Releases

Packages

Languages

License

relic-yuexi/ByteLLM

Folders and files

Latest commit

History

Repository files navigation

ByteLLM

Vision

Background

Related Work

Features

How to run

Project

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages