Overview

This repository provides scripts, popular third-party benchmarks, and instructions for evaluating the accuracy of Large Language Models (LLMs). It demonstrates how to use a ModelOpt quantized LLM with various established benchmarks, including deployment options using DirectML and TensorRT-LLM in a Windows environment.

Prerequisites

Category	Details
Operating System	Windows 10 or later
Python	- For ORT-DML GenAI, use Python 3.11. - For TensorRT-LLM, use Python 3.10. - All other backends are compatible with both Python 3.10 and 3.11.
Package Manager	pip
Compatible Hardware and Drivers	- Ensure necessary hardware (e.g., CUDA-compatible GPU) and drivers are installed, depending on the evaluation method: - DirectML for DirectML-based evaluation - CUDA for TensorRT
Additional Tools	- cmd: Recommended for running the provided commands. - Tar Utility: Included in Windows 10 and later via PowerShell. - Curl: Included in Windows 10 and later via PowerShell.

Accuracy Benchmarks

MMLU (Massive Multitask Language Understanding)

The MMLU benchmark assesses LLM performance across a wide range of tasks, producing a score between 0 and 1, where a higher score indicates better accuracy. Please refer the MMLU Paper for more details on this.

MMLU Setup

The table below lists the setup steps to prepare your environment for evaluating LLMs using the MMLU benchmark.

Step	Command or Description
Open PowerShell as Administrator	-
Create and Activate a Virtual Environment (Optional but Recommended)	`python -m venv llm_env` `.\llm_env\Scripts\Activate.ps1`
Install PyTorch and Related Packages	`pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124`
Install ONNX Runtime Packages	`pip install onnxruntime-directml==1.20` `pip install onnxruntime-genai-directml==0.4.0`
Install Benchmark Requirements	`pip install -r requirements.txt`
Download MMLU Data	`mkdir data` `curl -o .\data\mmlu.tar https://people.eecs.berkeley.edu/~hendrycks/data.tar` `tar -xf .\data\mmlu.tar -C .\data` `Move-Item .\data\data .\data\mmlu`

Evaluation Methods

Once the MMLU benchmark is set up, you can use the mmlu_benchmark.py script to evaluate LLMs deployed with various backends. Please refer examples below.

MMLU Benchmark with GenAI APIs for ORT-DML Deployment

To run the model with ORT-DML using GenAI, use the --ep genai_dml argument.

Test Suite

python mmlu_benchmark.py `
    --model_name causal `
    --model_path <ONNX_model_folder> `
    --ep genai_dml `
    --output_file <output_log_file.json> `
    --ntrain 5

Specific Subjects

python mmlu_benchmark.py `
    --model_name causal `
    --model_path <ONNX_model_folder> `
    --ep genai_dml `
    --output_file <output_log_file.json> `
    --subject abstract_algebra,anatomy,college_mathematics `
    --ntrain 5

MMLU Benchmark with ONNX Runtime APIs for DML, CUDA, or CPU Deployment

To run the model with ORT-DML, ORT-CUDA or ORT-CPU execution providers, use --ep ort_dml, --ep ort_cuda, or --ep ort_cpu respectively.

Test Suite

python mmlu_benchmark.py `
    --model_name causal `
    --model_path <ONNX_model_folder> `
    --ep ort_dml `
    --output_file <output_log_file.json> `
    --ntrain 5

Specific Subjects

python mmlu_benchmark.py `
    --model_name causal `
    --model_path <ONNX_model_folder> `
    --ep ort_dml `
    --output_file <output_log_file.json> `
    --subject abstract_algebra,anatomy,college_mathematics `
    --ntrain 5

MMLU Benchmark with Transformer APIs for PyTorch Hugging Face Models

To evaluate the PyTorch Hugging Face (HF) model, use the --ep pt argument.

Test Suite

python mmlu_benchmark.py `
    --model_name causal `
    --model_path <ONNX_model_folder> `
    --ep pt `
    --output_file <output_log_file.json> `
    --ntrain 5 `
    --dtype <torch_dtype in model's config.json {float16|bfloat16}>

Specific Subjects

python mmlu_benchmark.py `
    --model_name causal `
    --model_path <ONNX_model_folder> `
    --ep pt `
    --output_file <output_log_file.json> `
    --subject abstract_algebra,anatomy,college_mathematics `
    --ntrain 5 `
    --dtype <torch_dtype in model's config.json {float16|bfloat16}>

MMLU Benchmark with TensorRT-LLM APIs for TensorRT-LLM Deployment

Install TensorRT-LLM and Compatible PyTorch

pip install torch==2.4.0+cu121 --index-url https://download.pytorch.org/whl
pip install tensorrt_llm==0.12.0 `
    --extra-index-url https://pypi.nvidia.com `
    --extra-index-url https://download.pytorch.org/whl/cu121/torch/

Run the Benchmark

Test Suite

python mmlu_benchmark.py `
    --model_name causal `
    --hf_model_dir <hf_model_path> `
    --engine_dir <engine_path> `
    --ep trt-llm `
    --ntrain 5 `
    --output_file result.json

Specific Subjects

python mmlu_benchmark.py `
    --model_name causal `
    --hf_model_dir <hf_model_path> `
    --engine_dir <engine_path> `
    --ep trt-llm `
    --ntrain 5 `
    --output_file result.json `
    --subject abstract_algebra,anatomy,college_mathematics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Table of Contents

Overview

Prerequisites

Accuracy Benchmarks

MMLU (Massive Multitask Language Understanding)

MMLU Setup

Evaluation Methods

Files

README.md

Latest commit

History

README.md

File metadata and controls

Table of Contents

Overview

Prerequisites

Accuracy Benchmarks

MMLU (Massive Multitask Language Understanding)

MMLU Setup

Evaluation Methods