9 Apr 2024 | Negar Arabzadeh, Charles L. A. Clarke
This paper explores methods for evaluating generative information retrieval (Gen-IR) systems, which generate responses that are not drawn from a fixed collection of documents. Traditional IR evaluation methods, which rely on human assessors, are inadequate for Gen-IR systems. The paper focuses on methods that can operate autonomously and support human auditing. Five evaluation methods are compared: binary relevance, graded relevance, subtopic relevance, pairwise preferences, and embeddings. These methods are validated using the TREC Deep Learning Track datasets and applied to evaluate the output of several generative systems. The results show that subtopic relevance provides a good balance between autonomy and auditability, while pairwise preferences offer the best overall performance in recognizing differences between generative models. The paper also discusses limitations and future work, including the need for more extensive testing and the exploration of personalized and diverse responses in Gen-IR systems.This paper explores methods for evaluating generative information retrieval (Gen-IR) systems, which generate responses that are not drawn from a fixed collection of documents. Traditional IR evaluation methods, which rely on human assessors, are inadequate for Gen-IR systems. The paper focuses on methods that can operate autonomously and support human auditing. Five evaluation methods are compared: binary relevance, graded relevance, subtopic relevance, pairwise preferences, and embeddings. These methods are validated using the TREC Deep Learning Track datasets and applied to evaluate the output of several generative systems. The results show that subtopic relevance provides a good balance between autonomy and auditability, while pairwise preferences offer the best overall performance in recognizing differences between generative models. The paper also discusses limitations and future work, including the need for more extensive testing and the exploration of personalized and diverse responses in Gen-IR systems.