Skip to content

Latest commit

 

History

History
224 lines (169 loc) · 7.85 KB

README.md

File metadata and controls

224 lines (169 loc) · 7.85 KB

karpathy/llama2.c for the Internet Computer

Try it out

The 15M parameter model is the backend of ICGPT.

Getting Started

  • Install the C++ development environment for the Internet Computer (docs):

    • Create a python environment. (We like MiniConda, but use whatever you like!)

      conda create --name myllama2 python=3.11
      conda activate myllama2
    • Clone this repo and enter the llama2_c folder

      git clone https://github.com/icppWorld/icpp_llm.git
      cd icpp_llm/llama2_c
    • Install the required python packages (icpp-pro & ic-py):

      pip install -r requirements.txt
    • Install dfx:

      sh -ci "$(curl -fsSL https://internetcomputer.org/install.sh)"
      
      # Configure your shell
      source "$HOME/.local/share/dfx/env"
  • Deploy the 15M parameter pre-trained model to canister llama2_15M:

    • Compile & link to WebAssembly (wasm):

      icpp build-wasm

      Note:

      The first time you run this command, the tool-chain will be installed in ~/.icpp

      This can take a few minutes, depending on your internet speed and computer.

    • Start the local network:

      dfx start --clean
    • Deploy the wasm to a canister on the local network:

      dfx deploy llama2_15M
    • Check the health endpoint of the llama2_15M canister:

      $ dfx canister call llama2_15M health
      (variant { Ok = record { status_code = 200 : nat16 } })
    • Set the canister mode to 'chat-principal'

      $ dfx canister call llama2_15M set_canister_mode chat-principal
      (variant { Ok = record { status_code = 200 : nat16 } })
      
    • Upload the 15M parameter model & tokenizer: (We have included a fine-tuned model based on a 4096 tokens tokenizer)

      python -m scripts.upload --network local --canister llama2_15M --model models/stories15Mtok4096.bin --tokenizer tokenizers/tok4096.bin
    • Check the readiness endpoint, indicating it can be used for inference:

      $ dfx canister call llama2_15M ready
      (variant { Ok = record { status_code = 200 : nat16 } })
  • Test it with dfx.

    • Generate a new story, 60 tokens at a time, starting with an empty prompt:

      (Your story will be slightly different, because the temperature > 0.0)

      $ dfx canister call llama2_15M new_chat '()'
      (variant { Ok = record { status_code = 200 : nat16 } })
      
      $ dfx canister call llama2_15M inference '(record {prompt = "" : text; steps = 60 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
      (
        variant {
          Ok = record {
            num_tokens = 60 : nat64;
            inference = "Once upon a time, there was a little girl named Lily. She loved to play outside in the park. One day, she saw a big tree with a swing hanging from it. She ran to the swing and started to swing back and forth. It was so much fun!\nSuddenly,";
          }
        },
      )
      
      $ dfx canister call llama2_15M inference '(record {prompt = "" : text; steps = 60 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
      (
        variant {
          Ok = record {
            num_tokens = 60 : nat64;
            inference = " Lily saw a boy who was crying. She asked him what was wrong. The boy said he lost his toy car. Lily felt sad for him and wanted to help. She asked the boy if he wanted to play with her. The boy smiled and said yes.\nLily and the boy played together";
          }
        },
      )
      
      # etc.
      # If you keep going, at some point the LLM will end the story
    • Now generate a new story, starting with your own, non-empty prompt:

      $ dfx canister call llama2_15M new_chat '()'
      (variant { Ok = record { status_code = 200 : nat16 } })
      
      $ dfx canister call llama2_15M inference '(record {prompt = "Timmy climbed in a tree" : text; steps = 60 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
      (
        variant {
          Ok = record {
            num_tokens = 5 : nat64;
            inference = "Timmy climbed in a tree";
          }
        },
      )
      
      $ dfx canister call llama2_15M inference '(record {prompt = "" : text; steps = 60 : nat64; temperature = 0.1 : float32; topp = 0.9 : float32; rng_seed = 0 : nat64;})'
      (
        variant {
          Ok = record {
            num_tokens = 60 : nat64;
            inference = ". He was so excited to see what was on the roof. He looked up and saw a big bird. It was so big and it was so high up. Timmy wanted to get closer to the bird, so he started to climb.\nHe climbed and climbed until he reached the roof.";
          }
        },
      )
      
      # etc.
      # If you keep going, at some point the LLM will end the story

Next steps

You also will notice that using dfx to generate stories is not very user friendly. We created a little react frontend, available as an open source project: https://github.com/icppWorld/icgpt, and deployed to the IC as deployed as ICGPT.

llama2_260K

For quick tests, we have included a really small model, with only 260K parameters and fine-tuned with a tokenizer of 512 tokens.

  • model: stories260K/stories260k.bin
  • tokenizer: stories260K/tok512.bin

The CI/CD using a GitHub actions workflow, and the demo_pytest.sh script are based on this model.

demo_pytest.sh

  • The demo_pytest.sh script starts the local network, deploys llama2_260K, uploads the model & tokenizer, and runs the QA with pytest:

    • ./demo_pytest.sh , on Linux / Mac

demo shell scripts

  • The demo script starts the local network, deploys llama2, uploads the model & tokenizer, and generates two stories:
    • ./demo.sh

Models

HuggingFace

You can find many models in the llama2_c *.bin format on HuggingFace, for example:

Deploying to the IC main net

  • Deploying IC main network is as usual, but you will likely run into a time-out error during upload of the model. You have to patch ic-py as described here:

    #
    # IMPORTANT: ic-py will through a timeout => patch it here:
    # /home/arjaan/miniconda3/envs/icpp-pro-w-llama2/lib/python3.11/site-packages/httpx/_config.py
    # # DEFAULT_TIMEOUT_CONFIG = Timeout(timeout=5.0)
    # DEFAULT_TIMEOUT_CONFIG = Timeout(timeout=99999999.0)
    # And perhaps here:
    # /home/arjaan/miniconda3/envs/<your-env>/lib/python3.11/site-packages/httpcore/_backends/sync.py#L28-L29
    with map_exceptions(exc_map):
              # PATCH AB
              timeout = 999999999
              # ENDPATCH
              self._sock.settimeout(timeout)
              return self._sock.recv(max_bytes)
    
    
    # Now, this command should work
    python -m scripts.upload --network local --canister llama2_15M --model models/stories15Mtok4096.bin --tokenizer tokenizers/tok4096.bin
    

Run llama2.c natively

To do some prompt testing, it is nice to run llama2.c directly from the llama2.c github repo.

git clone https://github.com/icppWorld/llama2.c
cd llama2.c

conda create --name llama2-c python=3.10
conda activate llama2-c
pip install -r requirements.txt

make run

# Example command
./run models/stories15Mtok4096.bin -z tokenizers/tok4096.bin -t 0.1 -p 0.9 -i "Tony went swimming on the beach"

Fine tuning

When making your own checkpoint via fine-tuning, make sure to train with the correct version of karpathy/llama2.c:

release commit sha
0.3.0 b9fb86169f56bd787bb644c62a80bbab56f8dccc
0.2.0 57bf0e9ee4bbd61c98c4ad204b72f2b8881ac8cd
0.1.0 b28c1e26c5ab5660267633e1bdc910a43b7255bf