This is the code repository for the paper Calibrating Large Language Models Using Their Generations Only by Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun and Seong Joon Oh.
Developed at Parameter Lab with the support of Naver AI Lab.
The 🍑 fine-tuned models are now available on the Hugging Face hub 🤗.
Here’s how you can use them:
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model = AutoModel.from_pretrained("parameterlab/apricot_clustering_trivia_qa_deberta-v3-base_for_vicuna-7b-v1.5")
To select a different model, adjust the model name using the following pattern apricot_{model_type}_{dataset_name}_deberta-v3-base_for_{model_name}
where:
{model_type}
can be eitherclustering
orbinary
{dataset_name}
can be eithertrivia_qa
orcoqa
{model_name}
can be eithervicuna-7b-v1.5
orgpt-3.5-turbo-0125
The repository is simply installed by cloning the repository and installing dependencies via pip
using Python 3.10:
git clone https://github.com/parameterlab/apricot
cd apricot
pip3 install -r requirements.txt
Note that for some scripts and functionalities certain variables must be set in a secret.py
file in the project directory or
in the form of enviroment variables.
These include OPENAI_API_KEY
and OPENAI_API_KEY
when requesting data from the OpenAI API,
WANDB_API_KEY
and WANDB_USER_NAME
for using Weights & Biases (required for hyperparameter search), and COUNTRY_CODE
for carbon emission tracking.
The scripts to replicate experimental results are all given in /experiments
.
Before running them in sequence, make sure to generate the necessary data for both datasets and models.
This can be done for TriviaQA by simply running
python3 run_regression_experiment.py --dataset-name trivia_qa --device cuda --num-training-steps 0 --num-in-context-samples 10 --num-steps-temperature-scaling 0
python3 get_openai_data.py --dataset-name trivia_qa
and similary for CoQA:
python3 run_regression_experiment.py --dataset-name coqa --device cuda --num-training-steps 0 --num-in-context-samples 0 --num-steps-temperature-scaling 0
python3 get_openai_data.py --dataset-name coqa --num-in-context-samples 0
Afterwards, run the following scripts from /experiments
:
sh run_main_experiments.sh
for the experimental results in section 4
sh run_ablation_experiments.sh
and sh run_ablation_experiments_coqa.sh
for the ablation results in appendix A.5
Please cite the paper as following:
@inproceedings{ulmer2024calibrating,
title = "Calibrating Large Language Models Using Their Generations Only",
author = "Ulmer, Dennis and
Gubri, Martin and
Lee, Hwaran and
Yun, Sangdoo and
Oh, Seong",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.824",
doi = "10.18653/v1/2024.acl-long.824",
pages = "15440--15459",
abstract = "As large language models (LLMs) are increasingly deployed in user-facing applications, building trust and maintaining safety by accurately quantifying a model{'}s confidence in its prediction becomes even more important. However, finding effective ways to calibrate LLMs{---}especially when the only interface to the models is their generated text{---}remains a challenge. We propose APRICOT (Auxiliary prediction of confidence targets): A method to set confidence targets and train an additional model that predicts an LLM{'}s confidence based on its textual input and output alone. This approach has several advantages: It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages, for instance by verbalizing the predicted confidence or using it to re-prompting the LLM to accurately reflecting its uncertainty. We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers.",
}