9 May 2024 | Zahra Abbasiantaeb, Chuan Meng, Leif Azzopardi, Mohammad Aliannejadi
This paper explores the use of Large Language Models (LLMs) to fill relevance judgment "holes" in test collections, particularly in the context of Conversational Search using the TREC iKAT dataset. The authors investigate how LLMs can be used to generate relevance judgments, which are then used to evaluate new systems. They compare different LLMs, including commercial and open-source models, in zero-shot, one-shot, and fine-tuning settings. The study finds that while LLMs can produce highly correlated rankings with human judgments, the agreement in binary and graded levels is lower. Fine-tuning LLMs can improve the alignment between LLM-generated and human judgments, but this does not necessarily lead to higher ranking correlations. The authors also find that the effect of LLM-generated labels on the ranking of new models is magnified by the size of the holes. They conclude that generating LLM annotations on the entire document pool is more effective than filling holes, as it ensures consistent rankings and aligns with human annotations. Future work should focus on engineering and fine-tuning LLMs to better reflect human annotations.This paper explores the use of Large Language Models (LLMs) to fill relevance judgment "holes" in test collections, particularly in the context of Conversational Search using the TREC iKAT dataset. The authors investigate how LLMs can be used to generate relevance judgments, which are then used to evaluate new systems. They compare different LLMs, including commercial and open-source models, in zero-shot, one-shot, and fine-tuning settings. The study finds that while LLMs can produce highly correlated rankings with human judgments, the agreement in binary and graded levels is lower. Fine-tuning LLMs can improve the alignment between LLM-generated and human judgments, but this does not necessarily lead to higher ranking correlations. The authors also find that the effect of LLM-generated labels on the ranking of new models is magnified by the size of the holes. They conclude that generating LLM annotations on the entire document pool is more effective than filling holes, as it ensures consistent rankings and aligns with human annotations. Future work should focus on engineering and fine-tuning LLMs to better reflect human annotations.