Evaluating Retrieval Quality in Retrieval-Augmented Generation

Evaluating Retrieval Quality in Retrieval-Augmented Generation

July 14-18, 2024 | Alireza Salemi, Hamed Zamani
This paper introduces eRAG, a novel method for evaluating retrieval models within retrieval-augmented generation (RAG) systems. Traditional end-to-end evaluation methods are computationally expensive and have limited correlation with the downstream performance of RAG systems. eRAG evaluates each document in the retrieval list individually using the large language model (LLM) within the RAG system. The output generated for each document is then evaluated based on the downstream task ground truth labels. This approach allows for document-level annotations using various downstream task metrics, which are then aggregated using set-based or ranking metrics to obtain a single evaluation score for each retrieval result list. The proposed method achieves a higher correlation with downstream RAG performance compared to baseline methods, with improvements in Kendall's τ correlation ranging from 0.168 to 0.494. Additionally, eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation. The method is evaluated on a wide range of datasets, including Natural Questions (NQ), TriviaQA, HotpotQA, FEVER, and Wizard of Wikipedia (WoW), demonstrating its effectiveness across different tasks. The paper also explores the impact of different retrieval augmentation methods, the quantity of retrieved documents, and the LLM size on correlation. Results show that eRAG consistently outperforms other evaluation approaches, particularly when using the Fusion-in-Decoder (FiD) method. Furthermore, eRAG is more efficient than end-to-end evaluation in terms of both memory consumption and inference time, making it a promising approach for evaluating retrieval models in RAG systems. The implementation of eRAG is publicly available for research purposes.This paper introduces eRAG, a novel method for evaluating retrieval models within retrieval-augmented generation (RAG) systems. Traditional end-to-end evaluation methods are computationally expensive and have limited correlation with the downstream performance of RAG systems. eRAG evaluates each document in the retrieval list individually using the large language model (LLM) within the RAG system. The output generated for each document is then evaluated based on the downstream task ground truth labels. This approach allows for document-level annotations using various downstream task metrics, which are then aggregated using set-based or ranking metrics to obtain a single evaluation score for each retrieval result list. The proposed method achieves a higher correlation with downstream RAG performance compared to baseline methods, with improvements in Kendall's τ correlation ranging from 0.168 to 0.494. Additionally, eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation. The method is evaluated on a wide range of datasets, including Natural Questions (NQ), TriviaQA, HotpotQA, FEVER, and Wizard of Wikipedia (WoW), demonstrating its effectiveness across different tasks. The paper also explores the impact of different retrieval augmentation methods, the quantity of retrieved documents, and the LLM size on correlation. Results show that eRAG consistently outperforms other evaluation approaches, particularly when using the Fusion-in-Decoder (FiD) method. Furthermore, eRAG is more efficient than end-to-end evaluation in terms of both memory consumption and inference time, making it a promising approach for evaluating retrieval models in RAG systems. The implementation of eRAG is publicly available for research purposes.
Reach us at info@study.space