CRAG - Comprehensive RAG Benchmark

CRAG - Comprehensive RAG Benchmark

7 Jun 2024 | Xiao Yang*1, Kai Sun*1, Hao Xin*3, Yushi Sun*3, Nikita Bhalla1, Xiangsen Chen4, Sajal Choudhary1, Rongze Daniel Gui1, Ziran Will Jiang1, Ziyu Jiang4, Lingkun Kong1, Brian Moran1, Jiaqi Wang1, Yifan Ethan Xu1, An Yan1, Chenyu Yang4, Eting Yuan1, Hanwen Zha1, Nan Tang4, Lei Chen3,4, Nicolas Scheffer1, Yue Liu1, Nirav Shah1, Rakesh Wang1, Anuj Kumar1, Wen-tau Yih2, and Xin Luna Dong1
The paper introduces the Comprehensive RAG Benchmark (CRAG), a new benchmark designed to advance research in Retrieval-Augmented Generation (RAG) systems. CRAG consists of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) searches, covering five domains (Finance, Sports, Music, Movie, and Open Domain) and eight question categories. The benchmark aims to address the limitations of existing RAG datasets by reflecting diverse and dynamic real-world Question Answering (QA) tasks. CRAG is evaluated through three tasks: Retrieval Summarization, KG and Web Retrieval Augmentation, and End-to-End RAG. The evaluation mechanism distinguishes between hallucinated answers and missing answers, penalizing the former more heavily. The benchmark has been used to evaluate both straightforward RAG solutions and state-of-the-art industry solutions, revealing significant gaps in accuracy, especially for questions with higher dynamism, lower popularity, or higher complexity. The CRAG benchmark has also been used to organize a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days. The authors commit to maintaining CRAG to support ongoing research in RAG and general QA solutions.The paper introduces the Comprehensive RAG Benchmark (CRAG), a new benchmark designed to advance research in Retrieval-Augmented Generation (RAG) systems. CRAG consists of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) searches, covering five domains (Finance, Sports, Music, Movie, and Open Domain) and eight question categories. The benchmark aims to address the limitations of existing RAG datasets by reflecting diverse and dynamic real-world Question Answering (QA) tasks. CRAG is evaluated through three tasks: Retrieval Summarization, KG and Web Retrieval Augmentation, and End-to-End RAG. The evaluation mechanism distinguishes between hallucinated answers and missing answers, penalizing the former more heavily. The benchmark has been used to evaluate both straightforward RAG solutions and state-of-the-art industry solutions, revealing significant gaps in accuracy, especially for questions with higher dynamism, lower popularity, or higher complexity. The CRAG benchmark has also been used to organize a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days. The authors commit to maintaining CRAG to support ongoing research in RAG and general QA solutions.
Reach us at info@study.space