[slides] Synthetic Test Collections for Retrieval Evaluation

This paper explores the construction of fully synthetic test collections for information retrieval (IR) systems using Large Language Models (LLMs). The authors investigate whether it is possible to generate both synthetic queries and relevance judgments using LLMs, aiming to reduce the challenges associated with obtaining real user queries and manual relevance judgments. The study focuses on the passage retrieval task and uses the MS MARCO v2 passage corpus as a starting point for generating synthetic queries. Two methods—T5 and GPT-4—are employed to generate queries, and expert assessors refine the selection of queries. The paper also examines the reliability of synthetic queries by comparing their performance with real queries in the TREC Deep Learning Track 2023. Additionally, the authors explore the generation of synthetic relevance judgments using GPT-4 and evaluate the agreement between these judgments and human annotations. The results show that synthetic test collections can produce evaluation results similar to those obtained from real test collections, with minimal bias towards systems based on the same LLMs used in the construction process. The study concludes that while synthetic test collections are promising, further research is needed to fully understand and mitigate potential biases.This paper explores the construction of fully synthetic test collections for information retrieval (IR) systems using Large Language Models (LLMs). The authors investigate whether it is possible to generate both synthetic queries and relevance judgments using LLMs, aiming to reduce the challenges associated with obtaining real user queries and manual relevance judgments. The study focuses on the passage retrieval task and uses the MS MARCO v2 passage corpus as a starting point for generating synthetic queries. Two methods—T5 and GPT-4—are employed to generate queries, and expert assessors refine the selection of queries. The paper also examines the reliability of synthetic queries by comparing their performance with real queries in the TREC Deep Learning Track 2023. Additionally, the authors explore the generation of synthetic relevance judgments using GPT-4 and evaluate the agreement between these judgments and human annotations. The results show that synthetic test collections can produce evaluation results similar to those obtained from real test collections, with minimal bias towards systems based on the same LLMs used in the construction process. The study concludes that while synthetic test collections are promising, further research is needed to fully understand and mitigate potential biases.

Synthetic Test Collections for Retrieval Evaluation

July 14–18, 2024, Washington, DC, USA | Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, Daniel Campos