A Dataset for Evaluating Retrieval-Augmented Generation Across Documents
MultiHop-RAG: a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.
📄 Paper Link (Accepted by COLM 2024): MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
🤗 Hugging Face dataloader
1. For Retrieval
Please try 'simple_retrieval.py,' a sample use case demonstrating retrieval using this dataset.
pip install llama-index==0.9.40
# test simple retrieval and save results
python simple_retrieval.py --retriever BAAI/llm-embedder
# test simple retrieval with rerank and save results
python simple_retrieval.py --retriever BAAI/llm-embedder --rerank
2. For QA
Please try 'qa_llama.py,' a sample use case demonstrating query and answer with llama using this dataset.
python qa_llama.py
1. For Retrieval: 'retrieval_evaluate.py'
2. For QA: 'qa_evaluate.py'
python retrieval_evaluate.py --file {saved_file_path}
For research purposes, we open-sourced part of the code to construct the dataset. However, the current structure of the code is not very tidy. We will organize it in the future.
💡 Just For Reference: pipeline/
@misc{tang2024multihoprag,
title={MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries},
author={Yixuan Tang and Yi Yang},
year={2024},
eprint={2401.15391},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
MultiHop-RAG is licensed under ODC-BY