Synthetic Test Collections for Retrieval Evaluation

Synthetic Test Collections for Retrieval Evaluation

July 14-18, 2024 | Hossein A. Rahmani, Nick Craswell, Emine Yilmaz, Bhaskar Mitra, Daniel Campos
This paper explores the use of Large Language Models (LLMs) to generate synthetic test collections for information retrieval (IR) evaluation. Traditional test collections are often constructed using real user queries and manual relevance judgments, which are costly and time-consuming. The authors investigate whether LLMs can be used to generate synthetic queries and relevance judgments, enabling the creation of fully synthetic test collections that can be used for IR evaluation without the need for real user data or manual annotations. The study focuses on generating synthetic queries by sampling passages from the MS MARCO v2 corpus and using LLMs such as T5 and GPT-4 to generate queries. These queries are then evaluated by human experts to ensure quality. The synthetic queries are then combined with synthetic relevance judgments generated using GPT-4 to create a fully synthetic test collection. The results show that synthetic test collections can produce evaluation results similar to those obtained using real test collections. The authors also analyze potential biases in the synthetic test collections. They find that synthetic test collections constructed using LLMs may favor systems based on similar models, such as T5 or GPT-4. However, their experiments show that synthetic test collections generally produce results that are comparable to real test collections, suggesting that they can be a viable alternative for IR evaluation. The study concludes that LLMs have the potential to generate synthetic test collections that can be used for IR evaluation. However, further research is needed to fully understand the potential biases and to develop strategies to mitigate them. The authors suggest that future work should explore more advanced prompting methods and different LLMs to compare with the test collection they have created.This paper explores the use of Large Language Models (LLMs) to generate synthetic test collections for information retrieval (IR) evaluation. Traditional test collections are often constructed using real user queries and manual relevance judgments, which are costly and time-consuming. The authors investigate whether LLMs can be used to generate synthetic queries and relevance judgments, enabling the creation of fully synthetic test collections that can be used for IR evaluation without the need for real user data or manual annotations. The study focuses on generating synthetic queries by sampling passages from the MS MARCO v2 corpus and using LLMs such as T5 and GPT-4 to generate queries. These queries are then evaluated by human experts to ensure quality. The synthetic queries are then combined with synthetic relevance judgments generated using GPT-4 to create a fully synthetic test collection. The results show that synthetic test collections can produce evaluation results similar to those obtained using real test collections. The authors also analyze potential biases in the synthetic test collections. They find that synthetic test collections constructed using LLMs may favor systems based on similar models, such as T5 or GPT-4. However, their experiments show that synthetic test collections generally produce results that are comparable to real test collections, suggesting that they can be a viable alternative for IR evaluation. The study concludes that LLMs have the potential to generate synthetic test collections that can be used for IR evaluation. However, further research is needed to fully understand the potential biases and to develop strategies to mitigate them. The authors suggest that future work should explore more advanced prompting methods and different LLMs to compare with the test collection they have created.
Reach us at info@study.space