Run the Phi-3 models with the ONNX Runtime generate() API

Steps

Setup
Choose your platform
Run with DirectML
Run with NVIDIA CUDA
Run on CPU

Introduction

There are many Phi-3 models to choose from: Phi-3 mini, Phi-3 small, Phi-3 medium, and Phi-3 vision. With the Phi-3 models, there are also short (4K/8K) context versions and long (128K) context versions to choose from. The long context version can accept much longer prompts and produce longer output text, but it does consume more memory.

The Phi-3 ONNX models are hosted here in a collection on Hugging Face.

This tutorial downloads and runs the Phi-3 mini short context model. If you would like to use another model, please change the model name in the instructions below.

Setup

Install the git large file system extension

Hugging Face uses git for version control. To download the ONNX models you need git lfs to be installed, if you do not already have it.
- Windows: winget install -e --id GitHub.GitLFS (If you don't have winget, download and run the exe from the official source)
- Linux: apt-get install git-lfs
- MacOS: brew install git-lfs
Then run git lfs install
Install the HuggingFace CLI
```
pip install huggingface-hub[cli]
```

Choose your platform

Are you on a Windows machine with GPU?

I don't know → Review this guide to see whether you have a GPU in your Windows machine.
Yes → Follow the instructions for DirectML.
No → Do you have an NVIDIA GPU?
- I don't know → Review this guide to see whether you have a CUDA-capable GPU.
- Yes → Follow the instructions for NVIDIA CUDA GPU.
- No → Follow the instructions for CPU.

Note: Only one package and model is required based on your hardware. That is, only execute the steps for one of the following sections.

Run with DirectML

Download the model

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .

This command downloads the model into a folder called directml.

Install the generate() API
```
pip install onnxruntime-genai-directml
```
You should now see onnxruntime-genai-directml in your pip list.

Run the model

Run the model with phi3-qa.py.

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m directml\directml-int4-awq-block-128

Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:

Input: Tell me a joke about GPUs

Certainly! Here's a light-hearted joke about GPUs:


Why did the GPU go to school? Because it wanted to improve its "processing power"!


This joke plays on the double meaning of "processing power," referring both to the computational abilities of a GPU and the idea of a student wanting to improve their academic skills.

Run with NVIDIA CUDA

Download the model

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .

This command downloads the model into a folder called cuda.

Install the generate() API
```
pip install onnxruntime-genai-cuda
```

Run the model

Run the model with phi3-qa.py.

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m cuda/cuda-int4-rtn-block-32

Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:

Input: Tell me a joke about creative writing

Output:  Why don't writers ever get lost? Because they always follow the plot!

Run on CPU

Download the model

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .

This command downloads the model into a folder called cpu_and_mobile

Install the generate() API for CPU
```
pip install onnxruntime-genai
```

Run the model

Run the model with phi3-qa.py.

curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py -o phi3-qa.py
python phi3-qa.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4

Once the script has loaded the model, it will ask you for input in a loop, streaming the output as it is produced the model. For example:

Input: Tell me a joke about generative AI

Output:  Why did the generative AI go to school?

To improve its "creativity" algorithm!


This joke plays on the double meaning of "creativity" in the context of AI. Generative AI is often associated with its ability to produce creative content, but in this joke, it's humorously suggested that the AI is going to school to enhance its creative skills, as if it were a human student.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phi-3-tutorial.md

phi-3-tutorial.md

Run the Phi-3 models with the ONNX Runtime generate() API

Steps

Introduction

Setup

Choose your platform

Run with DirectML

Run with NVIDIA CUDA

Run on CPU

Files

phi-3-tutorial.md

Latest commit

History

phi-3-tutorial.md

File metadata and controls

Run the Phi-3 models with the ONNX Runtime generate() API

Steps

Introduction

Setup

Choose your platform

Run with DirectML

Run with NVIDIA CUDA

Run on CPU