llama.cpp
is a library that allows you to convert and run LLaMa models using 4-bit integer quantization on MacBook.
Please skip this step if llama.cpp
is already build. For simplicity, only one building option is shown below. Check the website for more details.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
The folder should be like:
|-- llama.cpp
| |-- convert.py
| |-- gguf-py
| | |-- examples
| | |-- gguf
| | |-- scripts
| | |-- ...
| |-- ...
Please skip this step if the model is already downloaded. Again, other options are provided on the website.
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/01-ai/Yi-6B-Chat
To install git-lfs:
brew install git-lfs
A typical folder of models is like:
|-- $MODEL_PATH
| |-- config.json
| |-- generation_config.json
| |-- LICENSE
| |-- main.py
| |-- model-00001-of-00003.safetensors
| |-- model-00002-of-00003.safetensors
| |-- model-00003-of-00003.safetensors
| |-- model.safetensors.index.json
| |-- tokenizer_config.json
| |-- tokenizer.model
| |-- ...
Make sure all Python dependencies required by llama.cpp
are installed:
cd llama.cpp
python3 -m pip install -r requirements.txt
Then, convert the model to gguf FP16 format:
python3 convert.py $MODEL_PATH
Lastly, quantize the model to 4-bits (using q4_0 method):
./quantize $MODEL_PATH/ggml-model-f16.gguf q4_0
It seems like the EOS token is converted incorrectly, therefore one additional step needed to reset the EOS token id.
python3 ./gguf-py/scripts/gguf-set-metadata.py $MODEL_PATH/ggml-model-q4_0.gguf tokenizer.ggml.eos_token_id 7
./main -m $MODEL_PATH/ggml-model-q4_0.gguf --chatml
Finally, you should be able to type your prompts and interact with the model.