Load Yi Series Chat Model with `llama.cpp`

llama.cpp is a library that allows you to convert and run LLaMa models using 4-bit integer quantization on MacBook.

1. Download `llama.cpp`

Please skip this step if llama.cpp is already build. For simplicity, only one building option is shown below. Check the website for more details.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

The folder should be like:

|-- llama.cpp
|   |-- convert.py
|   |-- gguf-py
|   |   |-- examples
|   |   |-- gguf
|   |   |-- scripts
|   |   |-- ...
|   |-- ...

2. Download Yi Series Model

Please skip this step if the model is already downloaded. Again, other options are provided on the website.

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/01-ai/Yi-6B-Chat

To install git-lfs:

brew install git-lfs

A typical folder of models is like:

|-- $MODEL_PATH
|   |-- config.json
|   |-- generation_config.json
|   |-- LICENSE
|   |-- main.py
|   |-- model-00001-of-00003.safetensors
|   |-- model-00002-of-00003.safetensors
|   |-- model-00003-of-00003.safetensors
|   |-- model.safetensors.index.json
|   |-- tokenizer_config.json
|   |-- tokenizer.model
|   |-- ...

3. Convert and Quantize the Model to 4-bits

Make sure all Python dependencies required by llama.cpp are installed:

cd llama.cpp
python3 -m pip install -r requirements.txt

Then, convert the model to gguf FP16 format:

python3 convert.py $MODEL_PATH

Lastly, quantize the model to 4-bits (using q4_0 method):

./quantize $MODEL_PATH/ggml-model-f16.gguf q4_0

3. Override EOS Token ID

It seems like the EOS token is converted incorrectly, therefore one additional step needed to reset the EOS token id.

python3 ./gguf-py/scripts/gguf-set-metadata.py $MODEL_PATH/ggml-model-q4_0.gguf tokenizer.ggml.eos_token_id 7

4. Run the Model

./main -m $MODEL_PATH/ggml-model-q4_0.gguf --chatml

Finally, you should be able to type your prompts and interact with the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_llama.cpp.md

README_llama.cpp.md

Load Yi Series Chat Model with `llama.cpp`

1. Download `llama.cpp`

2. Download Yi Series Model

3. Convert and Quantize the Model to 4-bits

3. Override EOS Token ID

4. Run the Model

Files

README_llama.cpp.md

Latest commit

History

README_llama.cpp.md

File metadata and controls

Load Yi Series Chat Model with llama.cpp

1. Download llama.cpp

2. Download Yi Series Model

3. Convert and Quantize the Model to 4-bits

3. Override EOS Token ID

4. Run the Model

Load Yi Series Chat Model with `llama.cpp`

1. Download `llama.cpp`