LLMs Can Patch Up Missing Relevance Judgments in Evaluation

LLMs Can Patch Up Missing Relevance Judgments in Evaluation

8 May 2024 | Shivani Upadhyay, Ehsan Kamalloo, Jimmy Lin, David R. Cheriton
This paper presents a method to automatically label unjudged documents in information retrieval (IR) benchmarks using large language models (LLMs). The problem of missing relevance judgments in IR test collections is a significant challenge, as these missing judgments can introduce biases into evaluation. The authors propose a framework that leverages LLMs to fill in these missing judgments, thereby improving the accuracy and reliability of IR evaluations. The method involves systematically simulating scenarios with varying degrees of missing judgments by randomly removing relevant documents from the relevance judgments in TREC DL tracks. The LLMs are instructed to assign fine-grained relevance labels based on detailed instructions and carefully crafted examples. The authors evaluate their method on three TREC DL datasets and find that their LLM-based approach achieves strong correlations with ground-truth relevance judgments. For example, in the extreme scenario of retaining only 10% of judgments, their method achieves a Kendall τ correlation of 0.87 and 0.92 on average for Vicuña-7B and GPT-3.5 Turbo, respectively. The results show that LLMs of various sizes are capable of acting as TREC assessors, showing a strong correlation with gold relevance judgments. The authors hope that their LLM-based evaluation framework will lay the groundwork for a fully automated and robust relevance judgment process, eliminating the biases arising from the presence of holes in the data. The key contributions of the paper include introducing an LLM-based framework for filling up the holes in IR test benchmarks and extensively evaluating their LLM assessor under various degrees of holes across several TREC DL datasets. The framework is designed to be an easy-to-use tool for researchers and practitioners to measure the effectiveness of retrieval systems with no holes in relevance judgments.This paper presents a method to automatically label unjudged documents in information retrieval (IR) benchmarks using large language models (LLMs). The problem of missing relevance judgments in IR test collections is a significant challenge, as these missing judgments can introduce biases into evaluation. The authors propose a framework that leverages LLMs to fill in these missing judgments, thereby improving the accuracy and reliability of IR evaluations. The method involves systematically simulating scenarios with varying degrees of missing judgments by randomly removing relevant documents from the relevance judgments in TREC DL tracks. The LLMs are instructed to assign fine-grained relevance labels based on detailed instructions and carefully crafted examples. The authors evaluate their method on three TREC DL datasets and find that their LLM-based approach achieves strong correlations with ground-truth relevance judgments. For example, in the extreme scenario of retaining only 10% of judgments, their method achieves a Kendall τ correlation of 0.87 and 0.92 on average for Vicuña-7B and GPT-3.5 Turbo, respectively. The results show that LLMs of various sizes are capable of acting as TREC assessors, showing a strong correlation with gold relevance judgments. The authors hope that their LLM-based evaluation framework will lay the groundwork for a fully automated and robust relevance judgment process, eliminating the biases arising from the presence of holes in the data. The key contributions of the paper include introducing an LLM-based framework for filling up the holes in IR test benchmarks and extensively evaluating their LLM assessor under various degrees of holes across several TREC DL datasets. The framework is designed to be an easy-to-use tool for researchers and practitioners to measure the effectiveness of retrieval systems with no holes in relevance judgments.
Reach us at info@study.space
[slides and audio] LLMs Can Patch Up Missing Relevance Judgments in Evaluation