[slides] CRAG - Comprehensive RAG Benchmark

The Comprehensive RAG Benchmark (CRAG) is a new benchmark for evaluating retrieval-augmented generation (RAG) systems. It contains 4,409 question-answer pairs across five domains and eight question categories, reflecting varied entity popularity and temporal dynamics. CRAG also includes mock APIs to simulate web and knowledge graph (KG) search. The benchmark aims to address the limitations of existing RAG datasets by providing a more realistic and diverse set of questions that reflect real-world QA challenges. CRAG evaluates RAG systems on three tasks: retrieval summarization, KG and web retrieval augmentation, and end-to-end RAG. The benchmark includes a scoring system that distinguishes between hallucinated and missing answers, with hallucinated answers receiving a higher penalty. The benchmark also provides an automatic evaluation mechanism for fast assessments. CRAG reveals that most advanced LLMs achieve up to 34% accuracy on the benchmark, while adding RAG improves accuracy to 44%. State-of-the-art industry RAG solutions answer 63% of questions without hallucination, but still struggle with questions involving dynamic facts, low-popularity entities, or complex queries. These results highlight the need for further research in RAG systems to achieve fully trustworthy QA. CRAG has been used as the foundation for the KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. The benchmark is designed to be comprehensive, realistic, and long-lasting, with the goal of advancing RAG solutions and general QA systems. CRAG includes a diverse set of questions across different domains and question types, with a focus on real-world scenarios. The benchmark includes both web search results and mock KGs to simulate retrieval from various sources. The dataset is generated using a combination of KGs and web content, with careful attention to entity popularity and temporal dynamics. The benchmark provides a detailed evaluation of RAG systems, including both straightforward and industry-leading solutions. The results show that while RAG improves accuracy, there are still significant challenges in handling dynamic facts, low-popularity entities, and complex queries. The benchmark also highlights the importance of search ranking in RAG systems, as well as the need for better leverage of KG data. CRAG is a comprehensive benchmark that provides a realistic and diverse set of questions for evaluating RAG systems. It includes a scoring system that distinguishes between hallucinated and missing answers, and provides an automatic evaluation mechanism for fast assessments. The benchmark has been used as the foundation for the KDD Cup 2024 challenge, and is designed to be a long-lasting resource for advancing RAG solutions and general QA systems.The Comprehensive RAG Benchmark (CRAG) is a new benchmark for evaluating retrieval-augmented generation (RAG) systems. It contains 4,409 question-answer pairs across five domains and eight question categories, reflecting varied entity popularity and temporal dynamics. CRAG also includes mock APIs to simulate web and knowledge graph (KG) search. The benchmark aims to address the limitations of existing RAG datasets by providing a more realistic and diverse set of questions that reflect real-world QA challenges. CRAG evaluates RAG systems on three tasks: retrieval summarization, KG and web retrieval augmentation, and end-to-end RAG. The benchmark includes a scoring system that distinguishes between hallucinated and missing answers, with hallucinated answers receiving a higher penalty. The benchmark also provides an automatic evaluation mechanism for fast assessments. CRAG reveals that most advanced LLMs achieve up to 34% accuracy on the benchmark, while adding RAG improves accuracy to 44%. State-of-the-art industry RAG solutions answer 63% of questions without hallucination, but still struggle with questions involving dynamic facts, low-popularity entities, or complex queries. These results highlight the need for further research in RAG systems to achieve fully trustworthy QA. CRAG has been used as the foundation for the KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. The benchmark is designed to be comprehensive, realistic, and long-lasting, with the goal of advancing RAG solutions and general QA systems. CRAG includes a diverse set of questions across different domains and question types, with a focus on real-world scenarios. The benchmark includes both web search results and mock KGs to simulate retrieval from various sources. The dataset is generated using a combination of KGs and web content, with careful attention to entity popularity and temporal dynamics. The benchmark provides a detailed evaluation of RAG systems, including both straightforward and industry-leading solutions. The results show that while RAG improves accuracy, there are still significant challenges in handling dynamic facts, low-popularity entities, and complex queries. The benchmark also highlights the importance of search ranking in RAG systems, as well as the need for better leverage of KG data. CRAG is a comprehensive benchmark that provides a realistic and diverse set of questions for evaluating RAG systems. It includes a scoring system that distinguishes between hallucinated and missing answers, and provides an automatic evaluation mechanism for fast assessments. The benchmark has been used as the foundation for the KDD Cup 2024 challenge, and is designed to be a long-lasting resource for advancing RAG solutions and general QA systems.