Pecca is starting as a Rust port of the excellent @karpathy llama2.c, itself a minimalistic adaptation of llama.cpp.
Compared to other Rust ports, Pecca leverages ndarray, which has several advantages:
- Type Safety: all matrices have proper dimensions (instead of giant flat arrays) and most operations will check dimensions compatibility.
- Speed: out of the box and single-threaded, Pecca is already slightly faster than the C version.
- Readability: matrix operations can be written succinctly. This first version of pecca-rs is only 425 lines, including comments.
Going forward, Pecca will leverage Rust and its ecosystem whenever it makes sense, rather than attempting to avoid dependencies above all (like llama.cpp).
git clone https://github.com/rahoua/pecca-rs.git
cd pecca-rs
wget -P ./models/stories/ https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
cargo run --release generate ./models/stories/stories15M.bin
Pecca can be run similarly with larger tiny stories models (like the 110M one) or the llama2 models (only 7B recommended so far). For a full list of command line options run:
pecca-rs --help
To get the llama2 models, follow the instructions for llama2.c. Pecca supports the same model format. As Pecca does not use memmap, loading and quantizing the model on the fly can take some time. To speed things up, the models can also be saved quantized using the -f --write-model <path>
command line switch.
For codellama, the instructions are similar except for the tokenizer which is slightly different. To make the process easier, the updated tokenizer is provided. To override the default tokenizer, run pecca using the -k
command line option:
./target/release/pecca-rs generate /path/to/codellama-instr-7b.bin -k "./models/tokenizer-code.bin"
At the moment there's no formal benchmark, we just provide rough estimates to give a ballpark of overall performance.
Llama2 7B model on a Macbook Pro M2 Max:
- llama2.c, f32: 4 tok/s
- llama.cpp, Q4KM quantization: 24 tok/s
- pecca, f32: 4 tok/s
- pecca, i8 quantization: 11 tok/s
A list of possible future developments for the project:
- Improved tokenizer.
- Inference performance and general memory footprint during inference.
- Experiment with SmoothQuant
- Explore extending ndarray
dot
operation to support cublas or Metal. - Additional parallelization of independent operations.
- Various refactoring.
- Support for additional models.