Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sysext: add llamaedge recipe #103

Merged
merged 1 commit into from
Dec 3, 2024
Merged

Conversation

hydai
Copy link
Contributor

@hydai hydai commented Nov 28, 2024

Add LlamaEdge sysext

This PR creates a sysext for running LlamaEdge on Flatcar. It will allow users to deploy their own LLM on the cluster.

How to use

Run create_llamaedge_sysext.sh for building the .raw file.

Then, use the following config:

variant: flatcar
version: 1.0.0
storage:
  files:
    - path: /opt/extensions/wasmedge-0.14.1-x86-64.raw
      mode: 0420
      contents:
        source: https://github.com/flatcar/sysext-bakery/releases/download/latest/wasmaedge-0.14.1-x86-64.raw
    - path: /opt/extensions/llamaedge-0.14.16-x86-64.raw
      mode: 0420
      contents:
        source: https://github.com/flatcar/sysext-bakery/releases/download/latest/llamaedge-0.14.16-x86-64.raw
  links:
    - target: /opt/extensions/llamaedge-0.14.16-x86-64.raw
      path: /etc/extensions/llamaedge.raw
      hard: false
    - target: /opt/extensions/wasmedge-0.14.1-x86-64.raw
      path: /etc/extensions/wasmedge.raw
      hard: false

Testing done

I've verified the behavior on my Digital Ocean instance.

Configuration

Yaml

variant: flatcar
version: 1.0.0
storage:
  files:
    - path: /opt/extensions/wasmedge-0.14.1-x86-64.raw
      mode: 0420
      contents:
        source: https://github.com/second-state/flatcar-sysext-bakery/releases/download/0.0.3/wasmedge-0.14.1-x86-64.raw
    - path: /opt/extensions/llamaedge-0.14.16-x86-64.raw
      mode: 0420
      contents:
        source: https://github.com/second-state/flatcar-sysext-bakery/releases/download/0.0.3/llamaedge-0.14.16-x86-64.raw
  links:
    - target: /opt/extensions/llamaedge-0.14.16-x86-64.raw
      path: /etc/extensions/llamaedge.raw
      hard: false
    - target: /opt/extensions/wasmedge-0.14.1-x86-64.raw
      path: /etc/extensions/wasmedge.raw
      hard: false

JSON

{
   "ignition":{
      "version":"3.3.0"
   },
   "storage":{
      "files":[
         {
            "path":"/opt/extensions/wasmedge-0.14.1-x86-64.raw",
            "contents":{
               "source":"https://github.com/second-state/flatcar-sysext-bakery/releases/download/0.0.3/wasmedge-0.14.1-x86-64.raw"
            },
            "mode":272
         },
         {
            "path":"/opt/extensions/llamaedge-0.14.16-x86-64.raw",
            "contents":{
               "source":"https://github.com/second-state/flatcar-sysext-bakery/releases/download/0.0.3/llamaedge-0.14.16-x86-64.raw"
            },
            "mode":272
         }
      ],
      "links":[
         {
            "path":"/etc/extensions/llamaedge.raw",
            "hard":false,
            "target":"/opt/extensions/llamaedge-0.14.16-x86-64.raw"
         },
         {
            "path":"/etc/extensions/wasmedge.raw",
            "hard":false,
            "target":"/opt/extensions/wasmedge-0.14.1-x86-64.raw"
         }
      ]
   }
}

Prepare the model

Depending on the hardware used, I chose a smaller model due to the limitations of my Digital Ocean instance.

wget https://huggingface.co/second-state/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q2_K.gguf

Start the server

The WASM is provided inside the sysext image. Please use the following path, /usr/lib/wasmedge/wasm/llama-api-server.wasm.

You can also reduce the CONTEXT_SIZE if running on a small memory instance.

MODEL_FILE="Llama-3.2-1B-Instruct-Q2_K.gguf"
API_SERVER_WASM="/usr/lib/wasmedge/wasm/llama-api-server.wasm"
PROMPT_TEMPLATE="llama-3-chat"
CONTEXT_SIZE=128
MODEL_NAME="llama-3.2-1B"

wasmedge \
  --dir .:. \
  --nn-preload default:GGML:AUTO:${MODEL_FILE} \
  ${API_SERVER_WASM} \
  --prompt-template ${PROMPT_TEMPLATE} \
  --ctx-size ${CONTEXT_SIZE} \
  --model-name ${MODEL_NAME}

It will start to load the model into memory and start the OpenAI compatible API server.

The expected output should be:

..omitted..
[2024-11-28 09:09:08.909] [info] [WASI-NN] GGML backend: llama_system_info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
[2024-11-28 09:09:08.920] [info] llama_core in crates/llama-core/src/lib.rs:128: running mode: chat
[2024-11-28 09:09:08.923] [info] llama_core in crates/llama-core/src/lib.rs:140: The core context has been initialized
[2024-11-28 09:09:08.923] [info] llama_core in crates/llama-core/src/lib.rs:230: Getting the plugin info
[2024-11-28 09:09:08.923] [info] llama_core in crates/llama-core/src/lib.rs:418: Get the running mode.
[2024-11-28 09:09:08.923] [info] llama_core in crates/llama-core/src/lib.rs:443: running mode: chat
[2024-11-28 09:09:08.923] [info] llama_core in crates/llama-core/src/lib.rs:312: Getting the plugin info by the graph named llama-3.2-1B
[2024-11-28 09:09:08.923] [info] llama_core::utils in crates/llama-core/src/utils.rs:175: Get the output buffer generated by the model named llama-3.2-1B
[2024-11-28 09:09:08.924] [info] llama_core::utils in crates/llama-core/src/utils.rs:193: Output buffer size: 95
[2024-11-28 09:09:08.924] [info] llama_core in crates/llama-core/src/lib.rs:372: Plugin info: b4067(commit 54ef9cfc)
[2024-11-28 09:09:08.924] [info] llama_api_server in llama-api-server/src/main.rs:459: plugin_ggml_version: b4067 (commit 54ef9cfc)
[2024-11-28 09:09:08.930] [info] llama_api_server in llama-api-server/src/main.rs:504: Listening on 0.0.0.0:8080

Interact with the API server

Please check the llamaedge document for more option details: https://github.com/LlamaEdge/LlamaEdge/tree/main/llama-api-server

Get model list

curl -X GET http://localhost:8080/v1/models -H 'accept:application/json'

Expected output:

{
   "object":"list",
   "data":[
      {
         "id":"llama-3.2-1B",
         "created":1732784948,
         "object":"model",
         "owned_by":"Not specified"
      }
   ]
}

Chat completion

curl -X POST http://localhost:8080/v1/chat/completions \
    -H 'accept:application/json' \
    -H 'Content-Type: application/json' \
    -d '{"messages":[{"role":"system", "content": "You are a helpful assistant. Reply in short sentence"}, {"role":"user", "content": "What is the capital of Japan?"}], "model":"llama-3.2-1B"}'

Expected output:

{
   "id":"chatcmpl-cdf8f57f-70ec-4cb3-b1f3-e60054f64981",
   "object":"chat.completion",
   "created":1732785197,
   "model":"llama-3.2-1B",
   "choices":[
      {
         "index":0,
         "message":{
            "content":"The capital of Japan is Tokyo.",
            "role":"assistant"
         },
         "finish_reason":"stop",
         "logprobs":null
      }
   ],
   "usage":{
      "prompt_tokens":33,
      "completion_tokens":9,
      "total_tokens":42
   }
}

Copy link
Contributor

@tormath1 tormath1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, that's exciting to see this running on Flatcar :)

@tormath1 tormath1 merged commit dd38a27 into flatcar:main Dec 3, 2024
@hydai hydai deleted the add_llamaedge branch December 3, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants