July 14–18, 2024, Washington, DC, USA | Alireza Salemi, Hamed Zamani
The paper "Evaluating Retrieval Quality in Retrieval-Augmented Generation" by Alireza Salemi and Hamed Zamani addresses the challenges of evaluating retrieval-augmented generation (RAG) systems, particularly the computational intensity and the lack of correlation between traditional evaluation methods and downstream performance. They propose a novel evaluation approach called eRAG, which leverages large language models (LLMs) to individually process each document in the retrieval list and generate document-level annotations based on downstream task ground truth labels. These annotations are then used to evaluate the relevance of each document to the query. The authors demonstrate that eRAG achieves higher correlations with downstream RAG performance compared to baseline methods, with improvements in Kendall's tau correlation ranging from 0.168 to 0.494. Additionally, eRAG offers significant computational advantages, reducing runtime and GPU memory consumption by up to 50 times compared to end-to-end evaluation. The paper includes extensive experiments on various datasets, showing that eRAG consistently outperforms other evaluation methods in terms of correlation and efficiency.The paper "Evaluating Retrieval Quality in Retrieval-Augmented Generation" by Alireza Salemi and Hamed Zamani addresses the challenges of evaluating retrieval-augmented generation (RAG) systems, particularly the computational intensity and the lack of correlation between traditional evaluation methods and downstream performance. They propose a novel evaluation approach called eRAG, which leverages large language models (LLMs) to individually process each document in the retrieval list and generate document-level annotations based on downstream task ground truth labels. These annotations are then used to evaluate the relevance of each document to the query. The authors demonstrate that eRAG achieves higher correlations with downstream RAG performance compared to baseline methods, with improvements in Kendall's tau correlation ranging from 0.168 to 0.494. Additionally, eRAG offers significant computational advantages, reducing runtime and GPU memory consumption by up to 50 times compared to end-to-end evaluation. The paper includes extensive experiments on various datasets, showing that eRAG consistently outperforms other evaluation methods in terms of correlation and efficiency.