MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

27 Jan 2024 | Yixuan Tang and Yi Yang
MultiHop-RAG is a novel benchmarking dataset designed to evaluate retrieval-augmented generation (RAG) systems for multi-hop queries, which require retrieving and reasoning over multiple pieces of evidence. The dataset includes a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. It was constructed using a collection of news articles, with GPT-4 used to generate multi-hop queries and their corresponding answers. The dataset is publicly available at https://github.com/yixuantt/MultiHop-RAG/. The paper introduces four types of multi-hop queries: Inference, Comparison, Temporal, and Null. Inference queries require reasoning from multiple pieces of evidence, Comparison queries involve comparing evidence, Temporal queries require analyzing temporal information, and Null queries have no answer in the knowledge base. The dataset is evaluated using retrieval and generation metrics, including Mean Average Precision at K (MAP@K), Mean Reciprocal Rank at K (MRR@K), and Hit Rate at K (Hit@K) for retrieval, and response accuracy for generation. Experiments on MultiHop-RAG show that existing RAG systems perform unsatisfactorily in retrieving and answering multi-hop queries. The results indicate that current RAG implementations are inadequate for effectively retrieving and answering multi-hop queries. The dataset is intended to be a valuable resource for the community in developing and benchmarking RAG systems, thereby facilitating greater adoption of LLMs in practice. The paper also discusses other potential use cases for MultiHop-RAG, including query decomposition and the development of LLM-based agents for multi-hop queries. The authors conclude that there is still room for improvement in the reasoning capabilities of LLMs for multi-hop queries.MultiHop-RAG is a novel benchmarking dataset designed to evaluate retrieval-augmented generation (RAG) systems for multi-hop queries, which require retrieving and reasoning over multiple pieces of evidence. The dataset includes a knowledge base, a large collection of multi-hop queries, their ground-truth answers, and the associated supporting evidence. It was constructed using a collection of news articles, with GPT-4 used to generate multi-hop queries and their corresponding answers. The dataset is publicly available at https://github.com/yixuantt/MultiHop-RAG/. The paper introduces four types of multi-hop queries: Inference, Comparison, Temporal, and Null. Inference queries require reasoning from multiple pieces of evidence, Comparison queries involve comparing evidence, Temporal queries require analyzing temporal information, and Null queries have no answer in the knowledge base. The dataset is evaluated using retrieval and generation metrics, including Mean Average Precision at K (MAP@K), Mean Reciprocal Rank at K (MRR@K), and Hit Rate at K (Hit@K) for retrieval, and response accuracy for generation. Experiments on MultiHop-RAG show that existing RAG systems perform unsatisfactorily in retrieving and answering multi-hop queries. The results indicate that current RAG implementations are inadequate for effectively retrieving and answering multi-hop queries. The dataset is intended to be a valuable resource for the community in developing and benchmarking RAG systems, thereby facilitating greater adoption of LLMs in practice. The paper also discusses other potential use cases for MultiHop-RAG, including query decomposition and the development of LLM-based agents for multi-hop queries. The authors conclude that there is still room for improvement in the reasoning capabilities of LLMs for multi-hop queries.
Reach us at info@study.space
[slides] MultiHop-RAG%3A Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries | StudySpace