BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

21 Oct 2021 | Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, Iryna Gurevych
The paper introduces BEIR (Benchmarking-IR), a robust and heterogeneous evaluation benchmark for information retrieval models. BEIR aims to address the limitations of existing benchmarks by providing a broad and diverse set of tasks and datasets to evaluate the generalization capabilities of retrieval models. The benchmark includes 18 publicly available datasets from nine different retrieval tasks across various domains, such as news, scientific publications, and social media. The evaluation covers a wide range of retrieval architectures, including lexical, sparse, dense, late-interaction, and re-ranking models. The results show that BM25 remains a strong baseline, while re-ranking and late-interaction models achieve the best zero-shot performances at the cost of computational efficiency. Dense and sparse retrieval models, though computationally efficient, often underperform other approaches, highlighting the need for improvements in their generalization capabilities. The paper also discusses the impact of annotation selection bias and suggests the need for unbiased datasets to allow fair comparisons across different retrieval systems. BEIR is publicly available and designed to facilitate further research and innovation in robust and generalizable retrieval systems.The paper introduces BEIR (Benchmarking-IR), a robust and heterogeneous evaluation benchmark for information retrieval models. BEIR aims to address the limitations of existing benchmarks by providing a broad and diverse set of tasks and datasets to evaluate the generalization capabilities of retrieval models. The benchmark includes 18 publicly available datasets from nine different retrieval tasks across various domains, such as news, scientific publications, and social media. The evaluation covers a wide range of retrieval architectures, including lexical, sparse, dense, late-interaction, and re-ranking models. The results show that BM25 remains a strong baseline, while re-ranking and late-interaction models achieve the best zero-shot performances at the cost of computational efficiency. Dense and sparse retrieval models, though computationally efficient, often underperform other approaches, highlighting the need for improvements in their generalization capabilities. The paper also discusses the impact of annotation selection bias and suggests the need for unbiased datasets to allow fair comparisons across different retrieval systems. BEIR is publicly available and designed to facilitate further research and innovation in robust and generalizable retrieval systems.
Reach us at info@study.space