**Retrieval-Augmented Generation (RAG)** has shown promise in enhancing large language models (LLMs) by retrieving relevant knowledge, improving response quality, and mitigating hallucinations. However, existing RAG systems struggle with multi-hop queries, which require reasoning over multiple pieces of evidence. To address this gap, the authors introduce MultiHop-RAG, a novel dataset designed for multi-hop queries. The dataset includes a knowledge base, a collection of multi-hop queries, their ground-truth answers, and supporting evidence. The authors detail the construction process, which involves extracting factual sentences from news articles, generating claims, identifying bridge-entities and topics, and creating multi-hop queries. Two experiments are conducted to evaluate the effectiveness of different embedding models and LLMs in handling multi-hop queries. The results show that current RAG methods perform poorly in retrieving and answering multi-hop queries. The authors hope that MultiHop-RAG will serve as a valuable resource for developing and benchmarking effective RAG systems, thereby advancing the adoption of LLMs in practical applications. The dataset and implemented RAG system are publicly available.**Retrieval-Augmented Generation (RAG)** has shown promise in enhancing large language models (LLMs) by retrieving relevant knowledge, improving response quality, and mitigating hallucinations. However, existing RAG systems struggle with multi-hop queries, which require reasoning over multiple pieces of evidence. To address this gap, the authors introduce MultiHop-RAG, a novel dataset designed for multi-hop queries. The dataset includes a knowledge base, a collection of multi-hop queries, their ground-truth answers, and supporting evidence. The authors detail the construction process, which involves extracting factual sentences from news articles, generating claims, identifying bridge-entities and topics, and creating multi-hop queries. Two experiments are conducted to evaluate the effectiveness of different embedding models and LLMs in handling multi-hop queries. The results show that current RAG methods perform poorly in retrieving and answering multi-hop queries. The authors hope that MultiHop-RAG will serve as a valuable resource for developing and benchmarking effective RAG systems, thereby advancing the adoption of LLMs in practical applications. The dataset and implemented RAG system are publicly available.