Can We Use Large Language Models to Fill Relevance Judgment Holes?

Can We Use Large Language Models to Fill Relevance Judgment Holes?

June 03-05, 2018 | Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, Mohammad Aliannejadi
Can We Use Large Language Models to Fill Relevance Judgment Holes? Zahra Abbasiantaeb, Leif Azzopardi, Chuan Meng, and Mohammad Aliannejadi investigate whether Large Language Models (LLMs) can be used to fill "holes" in test collections for relevance judgments, particularly in conversational search. Incomplete relevance judgments limit the reusability of test collections, as systems compared against previous systems often face disadvantages due to unassessed documents (holes). The authors explore using LLMs to generate relevance judgments based on existing human judgments, aiming to improve the consistency and quality of test collections. The study uses the TREC iKAT dataset, which is a benchmark for conversational search. The authors compare the performance of different LLMs (e.g., ChatGPT and LLaMA) in generating relevance judgments, and evaluate how well these judgments align with human annotations. They find that while LLMs can generate judgments that are highly correlated with human judgments, the correlation is lower when combining human and automatic judgments. Additionally, the performance of LLMs varies depending on the model and the type of prompt used. The authors also investigate how LLM-generated judgments affect the ranking of retrieval models. They find that using LLMs to fill holes in test collections can lead to more consistent rankings with human-generated labels, but the effectiveness depends on the model and the type of prompt. They conclude that generating LLM annotations on the entire document pool is more effective than just filling the holes, as it leads to higher correlation and ensures that the same labeling biases are applied to all models. The study highlights the potential of LLMs in improving the quality and reusability of test collections for conversational search, but also points out the challenges of relying on LLMs for relevance judgments, such as potential biases and the need for further research into prompt engineering and fine-tuning to align LLMs with human annotations.Can We Use Large Language Models to Fill Relevance Judgment Holes? Zahra Abbasiantaeb, Leif Azzopardi, Chuan Meng, and Mohammad Aliannejadi investigate whether Large Language Models (LLMs) can be used to fill "holes" in test collections for relevance judgments, particularly in conversational search. Incomplete relevance judgments limit the reusability of test collections, as systems compared against previous systems often face disadvantages due to unassessed documents (holes). The authors explore using LLMs to generate relevance judgments based on existing human judgments, aiming to improve the consistency and quality of test collections. The study uses the TREC iKAT dataset, which is a benchmark for conversational search. The authors compare the performance of different LLMs (e.g., ChatGPT and LLaMA) in generating relevance judgments, and evaluate how well these judgments align with human annotations. They find that while LLMs can generate judgments that are highly correlated with human judgments, the correlation is lower when combining human and automatic judgments. Additionally, the performance of LLMs varies depending on the model and the type of prompt used. The authors also investigate how LLM-generated judgments affect the ranking of retrieval models. They find that using LLMs to fill holes in test collections can lead to more consistent rankings with human-generated labels, but the effectiveness depends on the model and the type of prompt. They conclude that generating LLM annotations on the entire document pool is more effective than just filling the holes, as it leads to higher correlation and ensures that the same labeling biases are applied to all models. The study highlights the potential of LLMs in improving the quality and reusability of test collections for conversational search, but also points out the challenges of relying on LLMs for relevance judgments, such as potential biases and the need for further research into prompt engineering and fine-tuning to align LLMs with human annotations.
Reach us at info@study.space
Understanding Can We Use Large Language Models to Fill Relevance Judgment Holes%3F