25 Jun 2024 | Robert Friel, Masha Belyi, Atindriyo Sanyal
RAGBench is a comprehensive, large-scale benchmark dataset for evaluating Retrieval-Augmented Generation (RAG) systems, containing 100,000 examples across five industry-specific domains and various RAG task types. It is sourced from industry corpora such as user manuals, making it particularly relevant for real-world applications. The dataset includes detailed annotations for RAG evaluation, enabling actionable feedback for continuous improvement of production applications. RAGBench introduces the TRACe evaluation framework, a set of explainable and actionable metrics for assessing RAG systems. The dataset is available at https://huggingface.co/datasets/rungalileo/ragbench.
RAG systems consist of a document retriever and an LLM that generates responses based on the retrieved context. However, comprehensive evaluation of RAG systems remains challenging due to the lack of unified evaluation criteria and annotated datasets. RAGBench addresses this by providing a large-scale benchmark with detailed annotations for context relevance, answer faithfulness, context utilization, and answer completeness. The dataset includes real-world examples from various domains, including bio-medical research, legal contracts, customer support, and finance.
The TRACe evaluation framework measures the quality of the retriever and response generator components of RAG systems. It includes four metrics: utilization, relevance, adherence, and completeness. Utilization measures the fraction of retrieved context used by the generator, relevance measures the fraction of retrieved context relevant to the query, adherence measures the extent to which the response is grounded in the context, and completeness measures how well the response incorporates all relevant information in the context.
RAGBench is evaluated using a variety of LLM-based evaluators, including zero-shot GPT-3.5-judge, RAGAS, and TruLens. The results show that while few-shot LLM judges perform equally well across domains and task types, they still under-perform compared to a fine-tuned DeBERTa-large model. The study also highlights the importance of using RAGBench with TRACe for advancing the state of RAG evaluation systems. The dataset and evaluation framework provide a standardized benchmark for RAG evaluation, enabling more precise and actionable insights into the strengths and weaknesses of different RAG systems.RAGBench is a comprehensive, large-scale benchmark dataset for evaluating Retrieval-Augmented Generation (RAG) systems, containing 100,000 examples across five industry-specific domains and various RAG task types. It is sourced from industry corpora such as user manuals, making it particularly relevant for real-world applications. The dataset includes detailed annotations for RAG evaluation, enabling actionable feedback for continuous improvement of production applications. RAGBench introduces the TRACe evaluation framework, a set of explainable and actionable metrics for assessing RAG systems. The dataset is available at https://huggingface.co/datasets/rungalileo/ragbench.
RAG systems consist of a document retriever and an LLM that generates responses based on the retrieved context. However, comprehensive evaluation of RAG systems remains challenging due to the lack of unified evaluation criteria and annotated datasets. RAGBench addresses this by providing a large-scale benchmark with detailed annotations for context relevance, answer faithfulness, context utilization, and answer completeness. The dataset includes real-world examples from various domains, including bio-medical research, legal contracts, customer support, and finance.
The TRACe evaluation framework measures the quality of the retriever and response generator components of RAG systems. It includes four metrics: utilization, relevance, adherence, and completeness. Utilization measures the fraction of retrieved context used by the generator, relevance measures the fraction of retrieved context relevant to the query, adherence measures the extent to which the response is grounded in the context, and completeness measures how well the response incorporates all relevant information in the context.
RAGBench is evaluated using a variety of LLM-based evaluators, including zero-shot GPT-3.5-judge, RAGAS, and TruLens. The results show that while few-shot LLM judges perform equally well across domains and task types, they still under-perform compared to a fine-tuned DeBERTa-large model. The study also highlights the importance of using RAGBench with TRACe for advancing the state of RAG evaluation systems. The dataset and evaluation framework provide a standardized benchmark for RAG evaluation, enabling more precise and actionable insights into the strengths and weaknesses of different RAG systems.