Understanding Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

This paper introduces a novel method for evaluating Retrieval-Augmented Large Language Models (RAG) on specific tasks using automated, task-specific exams. The evaluation is based on Item Response Theory (IRT), which helps estimate the quality and informativeness of the exam questions. The method is designed to be automated, cost-efficient, interpretable, and robust, providing a scalable and scalable approach to selecting the optimal components for an RAG system. The authors demonstrate their approach on four open-ended question-answering tasks from diverse domains, revealing insights into factors affecting RAG performance, such as model size, retrieval mechanisms, prompting, and fine-tuning. Notably, they find that optimizing the retrieval mechanism often yields larger performance gains than simply increasing the model size. The paper also includes a detailed methodology for exam generation and evaluation, along with benchmark datasets and an open-source implementation of the proposed framework.This paper introduces a novel method for evaluating Retrieval-Augmented Large Language Models (RAG) on specific tasks using automated, task-specific exams. The evaluation is based on Item Response Theory (IRT), which helps estimate the quality and informativeness of the exam questions. The method is designed to be automated, cost-efficient, interpretable, and robust, providing a scalable and scalable approach to selecting the optimal components for an RAG system. The authors demonstrate their approach on four open-ended question-answering tasks from diverse domains, revealing insights into factors affecting RAG performance, such as model size, retrieval mechanisms, prompting, and fine-tuning. Notably, they find that optimizing the retrieval mechanism often yields larger performance gains than simply increasing the model size. The paper also includes a detailed methodology for exam generation and evaluation, along with benchmark datasets and an open-source implementation of the proposed framework.

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

22 May 2024 | Gauthier Guinet, Behrooz Omidvar-Tehrani, Anoop Deoras, Laurent Callot