8 May 2024 | Shivani Upadhyay, Ehsan Kamalloo, Jimmy Lin
The paper "LLMs Can Patch Up Missing Relevance Judgments in Evaluation" by Shivani Upadhyay, Ehsan Kamalloo, Jimmy Lin, and others from the David R. Cheriton School of Computer Science at the University of Waterloo, addresses the issue of incomplete relevance judgments in information retrieval (IR) benchmarks. These missing judgments can introduce biases into evaluation metrics, such as nDCG@k, MAP, and Pr@k, which often treat unjudged documents as non-relevant. The authors propose using large language models (LLMs) to automatically label unjudged documents, aiming to ensure more reliable and accurate evaluations.
The study simulates scenarios with varying degrees of unjudged documents (holes) by randomly removing relevant document judgments from TREC DL tracks. The experiments reveal a strong correlation between the LLM-based method and ground-truth relevance judgments. Specifically, on three TREC DL datasets, the Kendall τ correlation values for Vicuna-7B and GPT-3.5t Turbo are 0.87 and 0.92, respectively, even when only 10% of judgments are retained.
The authors highlight the limitations of traditional pooling methods, which can introduce artifacts and biases, and emphasize the need for automated solutions. They demonstrate that LLMs can effectively fill in the holes, providing a robust framework for evaluating retrieval systems. The framework is designed to be user-friendly and will be made available to the research community to facilitate more accurate and efficient evaluation of IR models.The paper "LLMs Can Patch Up Missing Relevance Judgments in Evaluation" by Shivani Upadhyay, Ehsan Kamalloo, Jimmy Lin, and others from the David R. Cheriton School of Computer Science at the University of Waterloo, addresses the issue of incomplete relevance judgments in information retrieval (IR) benchmarks. These missing judgments can introduce biases into evaluation metrics, such as nDCG@k, MAP, and Pr@k, which often treat unjudged documents as non-relevant. The authors propose using large language models (LLMs) to automatically label unjudged documents, aiming to ensure more reliable and accurate evaluations.
The study simulates scenarios with varying degrees of unjudged documents (holes) by randomly removing relevant document judgments from TREC DL tracks. The experiments reveal a strong correlation between the LLM-based method and ground-truth relevance judgments. Specifically, on three TREC DL datasets, the Kendall τ correlation values for Vicuna-7B and GPT-3.5t Turbo are 0.87 and 0.92, respectively, even when only 10% of judgments are retained.
The authors highlight the limitations of traditional pooling methods, which can introduce artifacts and biases, and emphasize the need for automated solutions. They demonstrate that LLMs can effectively fill in the holes, providing a robust framework for evaluating retrieval systems. The framework is designed to be user-friendly and will be made available to the research community to facilitate more accurate and efficient evaluation of IR models.