Query Performance Prediction using Relevance Judgments Generated by Large Language Models

Query Performance Prediction using Relevance Judgments Generated by Large Language Models

June 2024 | CHUAN MENG, NEGRAR ARABZADEH, ARIAN ASKARI, MOHAMMAD ALIANNEJADI, MAARTEN DE RIJKE
Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value, leading to limitations in representing different IR evaluation measures and interpretability. To address these issues, the authors propose QPP-GenRE, a framework that decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list. This allows predicting any IR evaluation measure using generated relevance judgments as pseudo-labels, enabling interpretation and error tracking. The framework uses open-source large language models (LLMs) to generate relevance judgments, ensuring scientific reproducibility. Two main challenges are addressed: (i) high computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting LLMs in a zero-/few-shot manner. An approximation strategy is devised to predict IR measures considering recall by judging only a few items in the ranked list. Additionally, LLMs are fine-tuned using human-labeled relevance judgments to improve their ability to generate relevance judgments. Experiments on the TREC 2019–2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers. The framework outperforms existing baselines in terms of correlation coefficients for precision-oriented metrics like RR@10 and recall-oriented metrics like nDCG@10. QPP-GenRE also demonstrates effectiveness in predicting IR evaluation measures with shallow judging depths and shows that fine-tuning LLMs improves relevance judgment generation and QPP quality. The framework is compatible with various relevance prediction methods and integrates with a state-of-the-art pointwise LLM-based re-ranker, RankLLaMA, to achieve high QPP quality. QPP-GenRE is efficient and has lower latency compared to some supervised QPP baselines, and it shows significant improvements in QPP quality over previous methods. The framework is suitable for knowledge-intensive professional search scenarios and can be used to analyze search system performance in offline settings. The authors also provide a reproducible implementation, including data, scripts, and fine-tuned checkpoints for LLaMA-7B, Llama-3-8B, and Llama-3-8B-Instruct.Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value, leading to limitations in representing different IR evaluation measures and interpretability. To address these issues, the authors propose QPP-GenRE, a framework that decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list. This allows predicting any IR evaluation measure using generated relevance judgments as pseudo-labels, enabling interpretation and error tracking. The framework uses open-source large language models (LLMs) to generate relevance judgments, ensuring scientific reproducibility. Two main challenges are addressed: (i) high computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting LLMs in a zero-/few-shot manner. An approximation strategy is devised to predict IR measures considering recall by judging only a few items in the ranked list. Additionally, LLMs are fine-tuned using human-labeled relevance judgments to improve their ability to generate relevance judgments. Experiments on the TREC 2019–2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers. The framework outperforms existing baselines in terms of correlation coefficients for precision-oriented metrics like RR@10 and recall-oriented metrics like nDCG@10. QPP-GenRE also demonstrates effectiveness in predicting IR evaluation measures with shallow judging depths and shows that fine-tuning LLMs improves relevance judgment generation and QPP quality. The framework is compatible with various relevance prediction methods and integrates with a state-of-the-art pointwise LLM-based re-ranker, RankLLaMA, to achieve high QPP quality. QPP-GenRE is efficient and has lower latency compared to some supervised QPP baselines, and it shows significant improvements in QPP quality over previous methods. The framework is suitable for knowledge-intensive professional search scenarios and can be used to analyze search system performance in offline settings. The authors also provide a reproducible implementation, including data, scripts, and fine-tuned checkpoints for LLaMA-7B, Llama-3-8B, and Llama-3-8B-Instruct.
Reach us at info@study.space