25 Jun 2024 | Robert Friel, Masha Belyi, Atindriyo Sanyal
**RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems**
Retrieval-Augmented Generation (RAG) systems have become essential for incorporating domain-specific knowledge into user-facing chat applications powered by Large Language Models (LLMs). However, evaluating these systems remains challenging due to the lack of unified criteria and annotated datasets. To address this, the authors introduce RAGBench, a comprehensive, large-scale benchmark dataset of 100k examples covering five industry-specific domains and various RAG task types. The dataset is sourced from industry corpora such as user manuals, making it particularly relevant for industrial applications.
The authors also formalize the TRACe evaluation framework, which includes a set of explainable and actionable RAG evaluation metrics applicable across all RAG domains. These metrics include *context relevance*, *answer faithfulness*, *context utilization*, and *answer completeness*. The labeled dataset is released on Hugging Face, facilitating holistic evaluation of RAG systems and enabling continuous improvement in production applications.
Experiments show that while few-shot LLM judges perform well across domains and task types, they still underperform compared to a fine-tuned DeBERTa model on the RAG evaluation task. The authors identify areas where existing approaches fall short and propose the adoption of RAGBench with TRACe to advance the state of RAG evaluation systems.**RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems**
Retrieval-Augmented Generation (RAG) systems have become essential for incorporating domain-specific knowledge into user-facing chat applications powered by Large Language Models (LLMs). However, evaluating these systems remains challenging due to the lack of unified criteria and annotated datasets. To address this, the authors introduce RAGBench, a comprehensive, large-scale benchmark dataset of 100k examples covering five industry-specific domains and various RAG task types. The dataset is sourced from industry corpora such as user manuals, making it particularly relevant for industrial applications.
The authors also formalize the TRACe evaluation framework, which includes a set of explainable and actionable RAG evaluation metrics applicable across all RAG domains. These metrics include *context relevance*, *answer faithfulness*, *context utilization*, and *answer completeness*. The labeled dataset is released on Hugging Face, facilitating holistic evaluation of RAG systems and enabling continuous improvement in production applications.
Experiments show that while few-shot LLM judges perform well across domains and task types, they still underperform compared to a fine-tuned DeBERTa model on the RAG evaluation task. The authors identify areas where existing approaches fall short and propose the adoption of RAGBench with TRACe to advance the state of RAG evaluation systems.