Pretrain TinyLlama

Installation

We expect you have CUDA 11.8 installed.

Install Pytorch Nightly.

pip install --index-url https://download.pytorch.org/whl/nightly/cu118 --pre 'torch>=2.1.0dev'

Build XFormers from Source

Note: as of 2023/09/02, xformers does not provide pre-built binaries for torch 2.1. You have to build it from source.

pip uninstall ninja -y && pip install ninja -U
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

Install Flash-Attention 2 and other fused operators:

git clone https://github.com/Dao-AILab/flash-attention
cd flash-attention
python setup.py install
cd csrc/rotary && pip install .
cd ../layer_norm && pip install .
cd ../xentropy && pip install .
cd ../.. && rm -rf flash-attention

Install Remaining Dependencies

pip install -r requirements.txt tokenizers sentencepiece

to install other dependencies. It may take >= 5 minutes to build xformers/flash-attention. Do not worry if the process seemly stagnant or the terminal print out many warnings.

Then you are ready to go 🎉!

Data Preparation

Download Datasets

Download the Slimpajama and Starcoderdata datasets to your chosen directory.

cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B
git clone https://huggingface.co/datasets/bigcode/starcoderdata

The SlimPajama dataset eats 893GB diskspace and the starcoderdata takes 290GB.

Tokenize data

Use the provided scripts to tokenize the datasets and divide them into chunks.

python scripts/prepare_starcoder.py --source_path /path/to/starcoderdata/ --tokenizer_path data/llama --destination_path data/slim_star_combined --split train --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama  --destination_path data/slim_star_combined --split validation --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama  --destination_path data/slim_star_combined --split train --percentage 1.0

The processed data will take 1.8T storage.

Pretraining

If your setup comprises two nodes, each with 8 GPUs, you can initiate pretraining with the following commands:

On node 1:

lightning run model \
    --node-rank=0  \
    --main-address=172.16.101.5 \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=2 \
    pretrain/tinyllama.py --devices 8 --train_data_dir data/slim_star  --val_data_dir data/slim_star

On node 2:

lightning run model \
    --node-rank=1  \
    --main-address=172.16.101.5 \
    --accelerator=cuda \
    --devices=8 \
    --num-nodes=2 \
    pretrain/tinyllama.py --devices 8 --train_data_dir data/slim_star   --val_data_dir data/slim_star

You can follow these instructions if you have a slurm cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRETRAIN.md

PRETRAIN.md

Pretrain TinyLlama

Installation

Install Pytorch Nightly.

Build XFormers from Source

Install Flash-Attention 2 and other fused operators:

Install Remaining Dependencies

Data Preparation

Download Datasets

Tokenize data

Pretraining

Files

PRETRAIN.md

Latest commit

History

PRETRAIN.md

File metadata and controls

Pretrain TinyLlama

Installation

Install Pytorch Nightly.

Build XFormers from Source

Install Flash-Attention 2 and other fused operators:

Install Remaining Dependencies

Data Preparation

Download Datasets

Tokenize data

Pretraining