March 2024 | Negar Arabzadeh, Charles L. A. Clarke
This paper presents a comparison of methods for evaluating generative information retrieval (Gen-IR) systems. Traditional information retrieval (IR) evaluation methods are not suitable for Gen-IR systems, which generate responses not drawn from a fixed corpus. The paper explores various methods for evaluating Gen-IR systems, including binary relevance, graded relevance, subtopic relevance, pairwise preferences, and embeddings. These methods are validated against human assessments on several TREC Deep Learning Track tasks and applied to evaluate the output of several purely generative systems. The paper also discusses the requirements for these methods, including agreement with human assessment, suitability for auditing by human assessors, and the ability to operate autonomously. The paper concludes that subtopic relevance provides a reasonable compromise between autonomy and auditability, while pairwise preferences provide the best overall performance. The paper also discusses the limitations of the work, including the focus on a limited set of TREC tracks and the use of commercial LLMs. The paper suggests future work in exploring additional test collections and Gen-IR systems, as well as conducting human studies to validate findings.This paper presents a comparison of methods for evaluating generative information retrieval (Gen-IR) systems. Traditional information retrieval (IR) evaluation methods are not suitable for Gen-IR systems, which generate responses not drawn from a fixed corpus. The paper explores various methods for evaluating Gen-IR systems, including binary relevance, graded relevance, subtopic relevance, pairwise preferences, and embeddings. These methods are validated against human assessments on several TREC Deep Learning Track tasks and applied to evaluate the output of several purely generative systems. The paper also discusses the requirements for these methods, including agreement with human assessment, suitability for auditing by human assessors, and the ability to operate autonomously. The paper concludes that subtopic relevance provides a reasonable compromise between autonomy and auditability, while pairwise preferences provide the best overall performance. The paper also discusses the limitations of the work, including the focus on a limited set of TREC tracks and the use of commercial LLMs. The paper suggests future work in exploring additional test collections and Gen-IR systems, as well as conducting human studies to validate findings.