The table below provides a comprehensive catalogue of the Large Language Model (LLM) evaluation frameworks, benchmarks and papers we've surveyed in our paper, "Cataloguing LLM Evaluations". It organizes them based on the taxonomy proposed in our paper.
The realm of LLM evaluation is advancing at an unparalleled pace. Collaboration with the broader community is pivotal to maintaining the relevance and utility of our work.
To that end, we invite submissions of LLM evaluation frameworks, benchmarks, and papers for inclusion in this catalogue.
Before you raise a PR for a new submission, please read our contribution guidelines. Submissions will be reviewed and integrated into the catalogue on a rolling basis.
For any inquiries, feel free to reach out to us at [email protected].
Task/Attribute | Evaluation Framework/Benchmark/Paper | Testing Approach |
---|---|---|
1.1. Natural Language Understanding | ||
Text classification | HELM
| Benchmarking |
Big-bench
| Benchmarking | |
Hugging Face
| Benchmarking | |
Sentiment analysis | HELM
| Benchmarking |
Evaluation Harness
| Benchmarking | |
Big-bench
| Benchmarking | |
Toxicity detection | HELM
| Benchmarking |
Evaluation Harness
| Benchmarking | |
Big-bench
| Benchmarking | |
Information retrieval | HELM
| Benchmarking |
Sufficient information | Big-bench
| Benchmarking |
FLASK
| Benchmarking (with human and model scoring) | |
Natural language inference | Big-bench
| Benchmarking |
Evaluation Harness
| Benchmarking | |
General English understanding | HELM
| Benchmarking |
Big-bench
| Benchmarking | |
Evaluation Harness
| Benchmarking | |
Eval Gauntlet
| Benchmarking | |
1.2. Natural Language Generation | ||
Summarization | HELM
| Benchmarking |
Big-bench
| Benchmarking | |
Evaluation Harness
| Benchmarking | |
Hugging Face
| Benchmarking | |
Question generation and answering | HELM
| Benchmarking |
Big-bench
| Benchmarking | |
Evaluation Harness
| Benchmarking | |
FLASK
| Benchmarking (with human and model scoring) | |
Hugging Face
| Benchmarking | |
Eval Gauntlet
| Benchmarking | |
Conversations and dialogue | MT-bench | Benchmarking (with human and model scoring) |
Evaluation Harness
| Benchmarking | |
Hugging Face
| Benchmarking | |
Paraphrasing | Big-bench
| Benchmarking |
Other response qualities | FLASK
| Benchmarking (with human and model scoring) |
Big-bench
| Benchmarking | |
Putting GPT-3's Creativity to the (Alternative Uses) Test | Benchmarking (with human scoring) | |
Miscellaneous text generation | Hugging Face
| Benchmarking |
1.3. Reasoning | HELM
| Benchmarking |
Big-bench
| Benchmarking | |
Evaluation Harness
| Benchmarking | |
Eval Gauntlet
| Benchmarking | |
1.4. Knowledge and factuality | HELM
| Benchmarking |
Big-bench
| Benchmarking | |
Evaluation Harness
| Benchmarking | |
FLASK
| Benchmarking (with human and model scoring) | |
Eval Gauntlet
| Benchmarking | |
1.5. Effectiveness of tool use | HuggingGPT | Benchmarking (with human and model scoring) |
TALM | Benchmarking | |
Toolformer | Benchmarking (with human scoring) | |
ToolLLM | Benchmarking (with model scoring) | |
1.6. Multilingualism | Big-bench
|
Benchmarking |
Evaluation Harness
|
Benchmarking | |
BELEBELE | Benchmarking | |
MASSIVE | Benchmarking | |
HELM
|
Benchmarking | |
Eval Gauntlet
|
Benchmarking | |
1.7. Context length | Big-bench
|
Benchmarking |
Evaluation Harness
|
Benchmarking | |
2.1. Law | LegalBench | Benchmarking (with algorithmic and human scoring) |
2.2. Medicine | Large Language Models Encode Clinical Knowledge | Benchmarking (with human scoring) |
Towards Generalist Biomedical AI | Benchmarking (with human scoring) | |
2.3. Finance | BloombergGPT | Benchmarking |
3.1. Toxicity generation | HELM
|
Benchmarking |
DecodingTrust
|
Benchmarking | |
Red Teaming Language Models to Reduce Harms | Manual Red Teaming | |
Red Teaming Language Models with Language Models | Automated Red Teaming | |
3.2. Bias | ||
Demographical representation | HELM | Benchmarking |
Finding New Biases in Language Models with a Holistic Descriptor Dataset | Benchmarking | |
Stereotype bias | HELM
|
Benchmarking |
DecodingTrust
|
Benchmarking | |
Big-bench
|
Benchmarking | |
Evaluation Harness
|
Benchmarking | |
Red Teaming Language Models to Reduce Harms | Manual Red Teaming | |
Fairness | DecodingTrust
|
Benchmarking |
Distributional bias | Red Teaming Language Models with Language Models | Automated Red Teaming |
Representation of subjective opinions | Towards Measuring the Representation of Subjective Global Opinions in Language Models | Benchmarking |
Political bias | From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models | Benchmarking |
The Self-Perception and Political Biases of ChatGPT | Benchmarking | |
Capability fairness | HELM
|
Benchmarking |
3.3. Machine ethics | DecodingTrust
|
Benchmarking |
Evaluation Harness
|
Benchmarking | |
3.4. Psychological traits | Does GPT-3 Demonstrate Psychopathy? | Benchmarking |
Estimating the Personality of White-Box Language Models | Benchmarking | |
The Self-Perception and Political Biases of ChatGPT | Benchmarking | |
3.5. Robustness | HELM
|
Benchmarking |
DecodingTrust
|
Benchmarking | |
Big-bench
|
Benchmarking | |
Susceptibility to Influence of Large Language Models | Benchmarking | |
3.6. Data governance | DecodingTrust
|
Benchmarking |
HELM
|
Benchmarking | |
Red Teaming Language Models to Reduce Harms | Manual Red Teaming | |
Red Teaming Language Models with Language Models | Automated Red Teaming | |
An Evaluation on Large Language Model Outputs: Discourse and Memorization | Benchmarking (with human scoring) | |
4.1. Dangerous Capabilities | ||
Offensive cyber capabilities | GPT-4 System Card
|
System Card |
Weapons acquisition | GPT-4 System Card
|
System Card |
Self and situation awareness | Big-bench
|
Benchmarking |
Autonomous replication / self-proliferation | ARC Evals
|
Manual Red Teaming |
Persuasion and manipulation | HELM
|
Benchmarking (with human scoring) |
Big-bench
|
Benchmarking | |
Co-writing with Opinionated Language Models Afffects Users' Views | Manual Red Teaming | |
5.1. Misinformation | HELM
|
Benchmarking |
Big-bench
|
Benchmarking | |
Red Teaming Language Models to Reduce Harms | Manual Red Teaming | |
5.2. Disinformation | HELM
|
Benchmarking (with human scoring) |
Big-bench
|
Benchmarking | |
5.3. Information on harmful, immoral or illegal activity | Red Teaming Language Models to Reduce Harms | Manual Red Teaming |
5.4. Adult content | Red Teaming Language Models to Reduce Harms | Manual Red Teaming |