karpathy/llama2.c for the Internet Computer

Try it out

The 15M parameter model is the backend of ICGPT.

Getting Started

Install the C++ development environment for the Internet Computer (docs):

Create a python environment. (We like MiniConda, but use whatever you like!)
```
conda create --name myllama2 python=3.11
conda activate myllama2
```

Clone this repo and enter the llama2_c folder

git clone https://github.com/icppWorld/icpp_llm.git
cd icpp_llm/llama2_c

Install the required python packages (icpp-pro & ic-py):
```
pip install -r requirements.txt
```

Install dfx:

sh -ci "$(curl -fsSL https://internetcomputer.org/install.sh)"

# Configure your shell
source "$HOME/.local/share/dfx/env"

Deploy the 15M parameter pre-trained model to canister llama2_15M:
- Compile & link to WebAssembly (wasm):
```
icpp build-wasm
```
  Note:
  
  The first time you run this command, the tool-chain will be installed in ~/.icpp
  
  This can take a few minutes, depending on your internet speed and computer.
- Start the local network:
```
dfx start --clean
```
- Deploy the wasm to a canister on the local network:
```
dfx deploy llama2_15M
```
- Check the health endpoint of the llama2_15M canister:
```
$ dfx canister call llama2_15M health
(variant { Ok = record { status_code = 200 : nat16 } })
```
- Set the canister mode to 'chat-principal'
```
$ dfx canister call llama2_15M set_canister_mode chat-principal
(variant { Ok = record { status_code = 200 : nat16 } })
```
- Upload the 15M parameter model & tokenizer: (We have included a fine-tuned model based on a 4096 tokens tokenizer)
```
python -m scripts.upload --network local --canister llama2_15M --model models/stories15Mtok4096.bin --tokenizer tokenizers/tok4096.bin
```
- Check the readiness endpoint, indicating it can be used for inference:
```
$ dfx canister call llama2_15M ready
(variant { Ok = record { status_code = 200 : nat16 } })
```

Test it with dfx.

Generate a new story, 60 tokens at a time, starting with an empty prompt:

(Your story will be slightly different, because the temperature > 0.0)

$ dfx canister call llama2_15M new_chat '()'
(variant { Ok = record { status_code = 200 : nat16 } })

$ dfx canister call llama2_15M inference '(record {prompt = "" : text; steps = 60 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
(
  variant {
    Ok = record {
      num_tokens = 60 : nat64;
      inference = "Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, she saw a big tree with a swing hanging from it. She ran to the swing and started to swing back and forth. It was so much fun!\nSuddenly,";
    }
  },
)

$ dfx canister call llama2_15M inference '(record {prompt = "" : text; steps = 60 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
(
  variant {
    Ok = record {
      num_tokens = 60 : nat64;
      inference = " Lily saw a boy who was crying. She asked him what was wrong. The boy said he lost his toy car. Lily felt sad for him and wanted to help. She asked the boy if he wanted to play with her. The boy smiled and said yes.\nLily and the boy played together";
    }
  },
)

# etc.
# If you keep going, at some point the LLM will end the story

Now generate a new story, starting with your own, non-empty prompt:

$ dfx canister call llama2_15M new_chat '()'
(variant { Ok = record { status_code = 200 : nat16 } })

$ dfx canister call llama2_15M inference '(record {prompt = "Timmy climbed in a tree" : text; steps = 60 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
(
  variant {
    Ok = record {
      num_tokens = 5 : nat64;
      inference = "Timmy climbed in a tree";
    }
  },
)

$ dfx canister call llama2_15M inference '(record {prompt = "" : text; steps = 60 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
(
  variant {
    Ok = record {
      num_tokens = 60 : nat64;
      inference = ". He was so excited to see what was on the roof. He looked up and saw a big bird. It was so big and it was so high up. Timmy wanted to get closer to the bird, so he started to climb.\nHe climbed and climbed until he reached the roof.";
    }
  },
)

# etc.
# If you keep going, at some point the LLM will end the story

Next steps

You also will notice that using dfx to generate stories is not very user friendly. We created a little react frontend, available as an open source project: https://github.com/icppWorld/icgpt, and deployed to the IC as deployed as ICGPT.

llama2_260K

For quick tests, we have included a really small model, with only 260K parameters and fine-tuned with a tokenizer of 512 tokens.

model: stories260K/stories260k.bin
tokenizer: stories260K/tok512.bin

The CI/CD using a GitHub actions workflow, and the demo_pytest.sh script are based on this model.

demo_pytest.sh

The demo_pytest.sh script starts the local network, deploys llama2_260K, uploads the model & tokenizer, and runs the QA with pytest:
- ./demo_pytest.sh , on Linux / Mac

demo shell scripts

The demo script starts the local network, deploys llama2, uploads the model & tokenizer, and generates two stories:
- ./demo.sh

Models

HuggingFace

You can find many models in the llama2_c *.bin format on HuggingFace, for example:

onicai/llama2_c_canister_models
karpathy/tinyllamas

Deploying to the IC main net

Deploying IC main network is as usual, but you will likely run into a time-out error during upload of the model. You have to patch ic-py as described here:

#
# IMPORTANT: ic-py will through a timeout => patch it here:
# /home/arjaan/miniconda3/envs/icpp-pro-w-llama2/lib/python3.11/site-packages/httpx/_config.py
# # DEFAULT_TIMEOUT_CONFIG = Timeout(timeout=5.0)
# DEFAULT_TIMEOUT_CONFIG = Timeout(timeout=99999999.0)
# And perhaps here:
# /home/arjaan/miniconda3/envs/<your-env>/lib/python3.11/site-packages/httpcore/_backends/sync.py#L28-L29
with map_exceptions(exc_map):
          # PATCH AB
          timeout = 999999999
          # ENDPATCH
          self._sock.settimeout(timeout)
          return self._sock.recv(max_bytes)


# Now, this command should work
python -m scripts.upload --network local --canister llama2_15M --model models/stories15Mtok4096.bin --tokenizer tokenizers/tok4096.bin

Run llama2.c natively

To do some prompt testing, it is nice to run llama2.c directly from the llama2.c github repo.

git clone https://github.com/icppWorld/llama2.c
cd llama2.c

conda create --name llama2-c python=3.10
conda activate llama2-c
pip install -r requirements.txt

make run

# Example command
./run models/stories15Mtok4096.bin -z tokenizers/tok4096.bin -t 0.1 -p 0.9 -i "Tony went swimming on the beach"

Fine tuning

When making your own checkpoint via fine-tuning, make sure to train with the correct version of karpathy/llama2.c:

release	commit sha
0.3.0	b9fb86169f56bd787bb644c62a80bbab56f8dccc
0.2.0	57bf0e9ee4bbd61c98c4ad204b72f2b8881ac8cd
0.1.0	b28c1e26c5ab5660267633e1bdc910a43b7255bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

karpathy/llama2.c for the Internet Computer

Try it out

Getting Started

Next steps

llama2_260K

demo_pytest.sh

demo shell scripts

Models

HuggingFace

Deploying to the IC main net

Run llama2.c natively

Fine tuning

Files

README.md

Latest commit

History

README.md

File metadata and controls

karpathy/llama2.c for the Internet Computer

Try it out

Getting Started

Next steps

llama2_260K

demo_pytest.sh

demo shell scripts

Models

HuggingFace

Deploying to the IC main net

Run llama2.c natively

Fine tuning