Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a Colab to use Gemma models with llama.cpp #51

Merged
merged 8 commits into from
Sep 27, 2024
351 changes: 351 additions & 0 deletions Gemma/Using_Gemma_with_LlamaCpp.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,351 @@
{
kinarr marked this conversation as resolved.
Show resolved Hide resolved
windmaple marked this conversation as resolved.
Show resolved Hide resolved
windmaple marked this conversation as resolved.
Show resolved Hide resolved
windmaple marked this conversation as resolved.
Show resolved Hide resolved
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Tce3stUlHN0L"
},
"source": [
"##### Copyright 2024 Google LLC."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "tuOe1ymfHZPu"
},
"outputs": [],
"source": [
"# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dfsDR_omdNea"
},
"source": [
"# Getting Started with Gemma 2 and LlamaCpp\n",
"\n",
"[Gemma](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open-source language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs), available in English, with open weights, pre-trained variants, and instruction-tuned variants.\n",
"Gemma models are well-suited for various text-generation tasks, including question-answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop, or your cloud infrastructure, democratizing access to state-of-the-art AI models and helping foster innovation for everyone.\n",
"\n",
"[llama.cpp](https://github.com/ggerganov/llama.cpp) is a C++ implementation of Meta AI's LLaMA and other large language model architectures, designed for efficient performance on local machines or within environments like Google Colab. It enables you to run large language models without needing extensive computational resources.\n",
"\n",
"To make working with llama.cpp more accessible,\n",
"[llama-cpp-python](https://github.com/abetlen/llama-cpp-python) provides Python bindings for the C++ library. This allows you to enjoy the performance optimizations of `llama.cpp` while benefiting from the simplicity and flexibility of Python. With llama-cpp-python, you get a convenient API for loading models, generating text, and customizing inference parameters.\n",
"\n",
"In this notebook, you will learn how to run Gemma 2 models using `llama.cpp` in a Google Colab environment. You'll install the necessary packages, set up the model, and run a sample prompt.\n",
"\n",
"<table align=\"left\">\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Using_Gemma_with_LlamaCpp.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FaqZItBdeokU"
},
"source": [
"## Setup\n",
"\n",
"### Select the Colab runtime\n",
"To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:\n",
"\n",
"1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n",
"2. Select **Change runtime type**.\n",
"3. Under **Hardware accelerator**, select **T4 GPU**.\n",
"\n",
"### Gemma setup\n",
"\n",
"**Before you dive into the tutorial, let's get you set up with Gemma:**\n",
"\n",
"1. **Hugging Face Account:** If you don't already have one, you can create a free Hugging Face account by clicking [here](https://huggingface.co/join).\n",
"2. **Gemma Model Access:** Head over to the [Gemma model page](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage conditions.\n",
"3. **Colab with Gemma Power:** For this tutorial, you'll need a Colab runtime with enough resources to handle the Gemma 2B model. Choose an appropriate runtime when starting your Colab session.\n",
"4. **Hugging Face Token:** Generate a Hugging Face access (preferably `write` permission) token by clicking [here](https://huggingface.co/settings/tokens). You'll need this token later in the tutorial.\n",
"\n",
"**Once you've completed these steps, you're ready to move on to the next section where you'll set up environment variables in your Colab environment.**\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CY2kGtsyYpHF"
},
"source": [
"### Configure your HF token\n",
"\n",
"Add your Hugging Face token to the Colab Secrets manager to securely store it.\n",
"\n",
"1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src=\"https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg\" alt=\"The Secrets tab is found on the left panel.\" width=50%>\n",
"2. Create a new secret with the name `HF_TOKEN`.\n",
"3. Copy/paste your token key into the Value input box of `HF_TOKEN`.\n",
"4. Toggle the button on the left to allow notebook access to the secret.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7-1PYEuJuJyN"
},
"outputs": [],
"source": [
"import os\n",
"from google.colab import userdata\n",
"\n",
"# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env\n",
"# vars as appropriate for your system.\n",
"os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "iwjo5_Uucxkw"
},
"source": [
"### Install dependencies\n",
"\n",
"You'll need to install a few Python packages and dependencies to interact with HuggingFace along with `llama-cpp-python` and run the model. Find some of the releases supporting CUDA 12.2 [here](https://abetlen.github.io/llama-cpp-python/whl/cu122/llama-cpp-python/).\n",
"\n",
"Run the following cell to install or upgrade it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5STiMkJQt4gF"
},
"outputs": [],
"source": [
"# The huggingface_hub library allows us to download models and other files from Hugging Face.\n",
"!pip install --upgrade -q huggingface_hub\n",
"\n",
"# The llama-cpp-python library allows us to leverage GPUs\n",
"!pip install llama-cpp-python==0.2.90 \\\n",
" -q -U --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2_bahJBmwvSp"
},
"source": [
"### Logging into Hugging Face Hub\n",
"\n",
"Next, you'll have to log into the Hugging Face Hub using your access token. This will allow us to download the Gemma model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ztSoQDMnt4ii"
},
"outputs": [],
"source": [
"from huggingface_hub import login\n",
"\n",
"login(os.environ[\"HF_TOKEN\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_-NAW6VXOeBt"
},
"source": [
"### Downloading the Gemma 2 Model\n",
"Once you're logged in, you can download the Gemma 2 model files from Hugging Face. The [Gemma 2 model](https://huggingface.co/google/gemma-2-2b-GGUF) is available in **GGUF** format, which is optimized for use with `llama.cpp` and compatible tools like Llamafile."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "D4fq8ha_t4k-"
},
"outputs": [],
"source": [
"from huggingface_hub import hf_hub_download\n",
"\n",
"# Specify the repository and filename\n",
"repo_id = 'google/gemma-2-2b-GGUF' # Repository containing the GGUF model\n",
"filename = '2b_pt_v2.gguf' # The GGUF model file\n",
"\n",
"# Download the model file to the current directory\n",
"hf_hub_download(repo_id=repo_id, filename=filename, local_dir='.')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "M65eFUY1lZwn"
},
"source": [
"### Running the Gemma 2 model using `llama.cpp`\n",
"\n",
"It's possible to run Gemma 2 models directly using the `llama.cpp` binary without relying on Python bindings. This approach allows you to work directly at the C++ level and can be useful for integration into non-Python applications or for performance optimizations.\n",
"\n",
"First, you need to download the `llama.cpp` [binary](https://github.com/ggerganov/llama.cpp/releases/tag/b3496) that supports Gemma 2.\n",
"\n",
"Note: The downloaded binary does not come with GPU support."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EkGfMH3elstU"
},
"outputs": [],
"source": [
"# Download the prebuilt llama.cpp binary (with Gemma 2 support)\n",
"!wget -O llama.cpp.zip https://github.com/ggerganov/llama.cpp/releases/download/b3496/llama-b3496-bin-ubuntu-x64.zip\n",
"!mkdir llama.cpp && unzip -qq llama.cpp.zip -d llama.cpp"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "F3jZgxzDl3fn"
},
"source": [
"Now, you can use the `llama.cpp` binary to run the Gemma 2 model with a sample prompt."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "aiCsVkNkl7DN"
},
"outputs": [],
"source": [
"# Define your prompt\n",
"prompt = \"Q: I have 3 apples and I eat 2. How many apples do I have left?\\nA:\"\n",
"model_path = \"2b_pt_v2.gguf\"\n",
"\n",
"# Run the model using llama.cpp\n",
"!./llama.cpp/build/bin/llama-cli -m \"{model_path}\" -p \"{prompt}\" -n 3 \\\n",
" --color --chat-template gemma --temp 0.7"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_WfM0ZHGnO0O"
},
"source": [
"Generating the response can take some time, but soon you'll learn how to use the `llama-cpp-python` Python library to utilize Colab's built-in GPU."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lr3K9wqMH0K-"
},
"source": [
"### Running the Gemma 2 model using `llama-cpp-python`\n",
"\n",
"You will initialize the Llama model using the `llama-cpp-python` library by loading a pre-trained Gemma 2 model from HuggingFace. Here's what each part of the code does:\n",
"\n",
"- `model_path`: Path to the model.\n",
"- `verbose`: Disables verbose logging during model loading for cleaner output.\n",
"- `n_gpu_layers`: Configures GPU acceleration. A value of `-1` means it will use as many GPU layers as possible.\n",
"- `chat_format`: Sets the expected input and output format to match the Gemma model's chat format.\n",
"\n",
"This sets up the environment to use the Gemma 2 language model with `llama.cpp` by loading the model weights from HuggingFace. It prepares the model for generating text in a format compatible with Gemma's specifications."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "gOQhIiXy7rlw"
},
"outputs": [],
"source": [
"from llama_cpp import Llama\n",
"\n",
"llm = Llama(\n",
" model_path=model_path,\n",
" verbose=False,\n",
" n_gpu_layers=-1,\n",
" chat_format=\"gemma\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fR0NezWlMkSU"
},
"source": [
"### Interacting with the Gemma model\n",
"Finally, you prompt the Gemma 2 model and generate the response.\n",
"\n",
"To do this and stream the output as it's generated, set `stream=True` when calling the `llm` function and iterate over the responses using a `for` loop; using `end=''` in the `print` function ensures the text streams without adding new lines after each chunk, and setting `flush=True` forces the print function to update the output immediately, providing smooth streaming."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-eMKXGYV9TQ0"
},
"outputs": [],
"source": [
"prompt = \"What are large language models?\"\n",
"\n",
"# Increase max_tokens if you need longer responses\n",
"output = llm(prompt, max_tokens=256, temperature=0.7, stream=True)\n",
"\n",
"# Iterate over the streaming response and print them\n",
"for response in output:\n",
" print(response['choices'][0]['text'], end='', flush=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QGSoEOU1QlZ5"
},
"source": [
"Congratulations! You've successfully set up the Gemma 2 model using `llama.cpp` in a Colab environment. You can now experiment with the model, generate text, and explore its capabilities."
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"name": "Using_Gemma_with_LlamaCpp.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@
"\n",
"<table align=\"left\">\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Using_Gemma_with_Lllamafile.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/Using_Gemma_with_Llamafile.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
"</table>"
]
Expand Down Expand Up @@ -468,7 +468,7 @@
"metadata": {
"accelerator": "GPU",
"colab": {
"name": "Using_Gemma_with_Lllamafile.ipynb",
"name": "Using_Gemma_with_Llamafile.ipynb",
"toc_visible": true
},
"kernelspec": {
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

# Welcome to the Gemma Cookbook
This is a collection of guides and examples for [Google Gemma](https://ai.google.dev/gemma/). Gemma is a family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models.

Expand Down Expand Up @@ -46,6 +47,7 @@ You can find the Gemma models on GitHub, Hugging Face models, Kaggle, Google Clo
| [Advanced_Prompting_Techniques.ipynb](Gemma/Advanced_Prompting_Techniques.ipynb) | Illustrate advanced prompting techniques with Gemma. |
| [Run_with_Ollama.ipynb](Gemma/Run_with_Ollama.ipynb) | Run Gemma models using [Ollama](https://www.ollama.com/). |
| [Using_Gemma_with_Llamafile.ipynb](Gemma/Using_Gemma_with_Llamafile.ipynb) | Run Gemma models using [Llamafile](https://github.com/Mozilla-Ocho/llamafile/). |
| [Using_Gemma_with_LlamaCpp.ipynb](Gemma/Using_Gemma_with_LlamaCpp.ipynb) | Run Gemma models using [LlamaCpp](https://github.com/abetlen/llama-cpp-python/). |
| [Deploy_with_vLLM.ipynb](Gemma/Deploy_with_vLLM.ipynb) | Deploy a Gemma model using [vLLM](https://github.com/vllm-project/vllm). |
| [Deploy_Gemma_in_Vertex_AI.ipynb](Gemma/Deploy_Gemma_in_Vertex_AI.ipynb) | Deploy a Gemma model using [Vertex AI](https://cloud.google.com/vertex-ai). |
| [RAG_with_ChromaDB.ipynb](Gemma/RAG_with_ChromaDB.ipynb) | Build a Retrieval Augmented Generation (RAG) system with Gemma using [ChromaDB](https://www.trychroma.com/) and [Hugging Face](https://huggingface.co/). |
Expand Down
1 change: 0 additions & 1 deletion WISHLIST.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ A wish list of cookbooks showcasing:

* Inference
* Integration with [Google GenKit](https://firebase.google.com/products/genkit)
* llama.cpp demo
* HF local-gemma demo
* ElasticSearch integration
* Gemma+Gemini with [routerLLM](https://github.com/lm-sys/RouteLLM)
Expand Down
Loading