Understanding Query Performance Prediction using Relevance Judgments Generated by Large Language Models

The paper introduces a novel Query Performance Prediction (QPP) framework, QPP-GenRE, which uses automatically generated relevance judgments to predict the retrieval quality of a search system for a query. Unlike traditional QPP methods that typically return a single scalar value, QPP-GenRE decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query, allowing for the prediction of any Information Retrieval (IR) evaluation measure. This approach addresses the limitations of single scalar values, which are insufficient for accurately representing different IR evaluation measures and limit interpretability. To address the challenges of excessive computational costs and limited performance in prompting open-source Large Language Models (LLMs) in a zero-/few-shot manner, the authors propose an approximation strategy to predict IR measures considering recall and fine-tune LLMs using human-labeled relevance judgments. Experiments on TREC 2019–2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers in terms of RR@10 and nDCG@10. The paper also explores the impact of judging depth and the choice of LLMs on the quality of generated relevance judgments and QPP. Fine-tuning LLMs with parameter-efficient fine-tuning (PEFT) significantly improves the quality of relevance judgment generation and QPP. Additionally, the authors demonstrate the compatibility of QPP-GenRE with other types of relevance prediction methods, such as RankLLaMA, a state-of-the-art pointwise LLM-based re-ranker. The main contributions of the paper include: 1. Introducing QPP-GenRE, a novel QPP framework that decomposes QPP into independent subtasks and predicts different IR evaluation measures. 2. Proposing an approximation strategy to predict IR measures considering recall. 3. Fine-tuning leading open-source LLMs for generating relevance judgments. 4. Conducting experiments on four datasets, showing that QPP-GenRE outperforms state-of-the-art QPP baselines in predicting RR@10 and nDCG@10. The paper also discusses the interpretability of QPP-GenRE, computational cost analysis, and application scenarios, emphasizing its suitability for knowledge-intensive professional search scenarios.The paper introduces a novel Query Performance Prediction (QPP) framework, QPP-GenRE, which uses automatically generated relevance judgments to predict the retrieval quality of a search system for a query. Unlike traditional QPP methods that typically return a single scalar value, QPP-GenRE decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query, allowing for the prediction of any Information Retrieval (IR) evaluation measure. This approach addresses the limitations of single scalar values, which are insufficient for accurately representing different IR evaluation measures and limit interpretability. To address the challenges of excessive computational costs and limited performance in prompting open-source Large Language Models (LLMs) in a zero-/few-shot manner, the authors propose an approximation strategy to predict IR measures considering recall and fine-tune LLMs using human-labeled relevance judgments. Experiments on TREC 2019–2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers in terms of RR@10 and nDCG@10. The paper also explores the impact of judging depth and the choice of LLMs on the quality of generated relevance judgments and QPP. Fine-tuning LLMs with parameter-efficient fine-tuning (PEFT) significantly improves the quality of relevance judgment generation and QPP. Additionally, the authors demonstrate the compatibility of QPP-GenRE with other types of relevance prediction methods, such as RankLLaMA, a state-of-the-art pointwise LLM-based re-ranker. The main contributions of the paper include: 1. Introducing QPP-GenRE, a novel QPP framework that decomposes QPP into independent subtasks and predicts different IR evaluation measures. 2. Proposing an approximation strategy to predict IR measures considering recall. 3. Fine-tuning leading open-source LLMs for generating relevance judgments. 4. Conducting experiments on four datasets, showing that QPP-GenRE outperforms state-of-the-art QPP baselines in predicting RR@10 and nDCG@10. The paper also discusses the interpretability of QPP-GenRE, computational cost analysis, and application scenarios, emphasizing its suitability for knowledge-intensive professional search scenarios.

Query Performance Prediction using Relevance Judgments Generated by Large Language Models

June 2024 | CHUAN MENG, University of Amsterdam, The Netherlands; NEGAR ARABZADEH, University of Waterloo, Canada; ARIAN ASKARI, Leiden University, The Netherlands; MOHAMMAD ALIANNEJADI, University of Amsterdam, The Netherlands; MAARTEN DE RIJKE, University of Amsterdam, The Netherlands