RetroLLM: Empowering LLMs to Retrieve Fine-grained Evidence within Generation

💡 Overview

Traditional Retrieval-Augmented Generation (RAG) methods rely on separate retrievers for document fetching and often suffer from redundant input tokens for retrieved documents and lack of joint optimization of RAG system. RetroLLM introduces a unified framework that integrates retrieval and generation into a single auto-regressive decoding process, enabling LLMs to directly generate fine-grained evidence from the corpus with FM-Index constrained decoding.

To mitigate false pruning in constrained evidence generation, we propose hierarchical FM-Index constraints to first identify relevant document subsets, and a forward-looking constrained decoding strategy that let the model to be aware of the relevance of future sequences. This approach improves evidence accuracy while significantly reducing input token usage, as we only need to input the question to LLM to perform the entire RAG process.

🔧 Installation

1. Environment Setup

# Create conda environment
conda create -n retrollm python=3.9
conda activate retrollm

# Install requirements
pip install -r requirements.txt

2. Install SWIG

wget http://prdownloads.sourceforge.net/swig/swig-4.0.2.tar.gz
tar zxvf swig-4.0.2.tar.gz
cd swig-4.2.1
./configure --without-pcre --prefix=YOUR_CODE_DIR
make -j
make install

3. Build FM-Index Module

cd RetroLLM
env CFLAGS='-fPIC' CXXFLAGS='-fPIC' scripts/res/external/sdsl-lite/install.sh
swig -c++ -python scripts/seal/cpp_modules/fm_index.i && python setup.py build_ext --inplace

🏃 Quick Start

Data Preparation

RetroLLM follows the FlashRAG data format for both training and evaluation. The datasets include:

Training Datasets:

Natural Questions (NQ)
TriviaQA
HotpotQA

Evaluation Datasets:

Natural Questions (NQ)
TriviaQA
HotpotQA
PopQA
2WikiMultiHopQA

Each dataset should be processed following the FlashRAG format specifications. Detailed training scripts coming soon.

Evaluation

To evaluate the model on the test sets:

Edit scripts/generate.py to set the correct paths for:
- Model and checkpoint paths
- Dataset paths
- Output directory
Run evaluation:

python scripts/generate.py

Citation

If you find this work helpful, please cite our paper:

@article{retrollm2024,
    title={RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation},
    author={Xiaoxi Li and
            Jiajie Jin and
            Yujia Zhou and
            Yongkang Wu and
            Zhonghua Li and
            Qi Ye and
            Zhicheng Dou},
    journal={CoRR},
    volume={abs/2412.11919},
    year={2024},
    url={https://arxiv.org/abs/2412.11919},
    eprinttype={arXiv},
    eprint={2412.11919}
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
figures		figures
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RetroLLM: Empowering LLMs to Retrieve Fine-grained Evidence within Generation

💡 Overview

🔧 Installation

1. Environment Setup

2. Install SWIG

3. Build FM-Index Module

🏃 Quick Start

Data Preparation

Evaluation

Citation

License

About

Releases

Packages

Contributors 2

Languages

License

sunnynexus/RetroLLM

Folders and files

Latest commit

History

Repository files navigation

RetroLLM: Empowering LLMs to Retrieve Fine-grained Evidence within Generation

💡 Overview

🔧 Installation

1. Environment Setup

2. Install SWIG

3. Build FM-Index Module

🏃 Quick Start

Data Preparation

Evaluation

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages