Code repo for EMNLP 2024 paper - EfficientRAG: Efficient Retriever for Multi-Hop Question Answering
Efficient RAG is a new framework to train Labeler and Filter to learn to conduct multi-hop RAG without multiple LLM calls.
- 2024-09-12 open source the code
- 2025-03-04 release our data
You can now download our synthesized data from this link.
You should unzip the EfficientRAG.zip
file and place all the data under the data
directory.
Within this directory, the negative_sampling_extracted
folder contains our final synthesized data, which is referenced in 2.4 Negative Sampling.
Additionally, the efficient_rag
directory includes two folders: labeler
and filter
, which store the training data constructed for the model, as referenced in 2.5 Training Data.
You need to install PyTorch >= 2.1.0 first, and then install dependent Python libraries by running the command
pip install -r requirements.txt
You can also create a conda environment with python>=3.9
conda create -n <ENV_NAME> python=3.9 pip
conda activate <ENV_NAME>
pip install -r requirements.txt
-
Download the dataset from HotpotQA, 2WikiMQA and MuSiQue. Separate them as train, dev and test set, and then put them under
data/dataset
. -
Download the retriever model Contriever and base model DeBERTa, put them under
model_cache
-
Prepare the corpus by extract documents and construct embedding.
python src/retrievers/multihop_data_extractor.py --dataset hotpotQA
python src/retrievers/passage_embedder.py \
--passages data/corpus/hotpotQA/corpus.jsonl \
--output_dir data/corpus/hotpotQA/contriever \
--model_type contriever
- Deploy LLaMA-3-70B-Instruct with vLLM framework, and configure it in
src/language_models/llama.py
We will use hotpotQA training set as an example. You could construct 2WikiMQA and MuSiQue in the same way.
python src/data_synthesize/query_decompose.py \
--dataset hotpotQA \
--split train \
--model llama3
python src/data_synthesize/token_labeling.py \
--dataset hotpotQA \
--split train \
--model llama3
python src/data_synthesize/token_extraction.py \
--data_path data/synthesized_token_labeling/hotpotQA/train.jsonl \
--save_path data/token_extracted/hotpotQA/train.jsonl \
--verbose
python src/data_synthesize/next_hop_query_construction.py \
--dataset hotpotQA \
--split train \
--model llama
python src/data_synthesize/next_hop_query_filtering.py \
--data_path data/synthesized_next_query/hotpotQA/train.jsonl \
--save_path data/next_query_extracted/hotpotQA/train.jsonl \
--verbose
python src/data_synthesize/negative_sampling.py \
--dataset hotpotQA \
--split train \
--retriever contriever
python src/data_synthesize/negative_sampling_labeled.py \
--dataset hotpotQA \
--split train \
--model llama
python src/data_synthesize/negative_token_extraction.py \
--dataset hotpotQA \
--split train \
--verbose
python src/data_synthesize/training_data_synthesize.py \
--dataset hotpotQA \
--split train
Training Filter model
python src/efficient_rag/filter_training.py \
--dataset hotpotQA \
--save_path saved_models/filter
Training Labeler model
python src/efficient_rag/labeler_training.py \
--dataset hotpotQA \
--tags 2
EfficientRAG retrieve procedure
python src/efficientrag_retrieve.py \
--dataset hotpotQA \
--retriever contriever \
--labels 2 \
--labeler_ckpt <<PATH_TO_LABELER_CKPT>> \
--filter_ckpt <<PATH_TO_FILTER_CKPT>> \
--topk 10 \
Use LLaMA-3-8B-Instruct as generator
python src/efficientrag_qa.py \
--fpath <<MODEL_INFERENCE_RESULT>> \
--model llama-8B \
--dataset hotpotQA
If you find this paper or code useful, please cite by:
@inproceedings{zhuang2024efficientrag,
title={EfficientRAG: Efficient Retriever for Multi-Hop Question Answering},
author={Zhuang, Ziyuan and Zhang, Zhiyang and Cheng, Sitao and Yang, Fangkai and Liu, Jia and Huang, Shujian and Lin, Qingwei and Rajmohan, Saravan and Zhang, Dongmei and Zhang, Qi},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
pages={3392--3411},
year={2024}
}