21 Oct 2021 | Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, Iryna Gurevych
BEIR is a heterogeneous benchmark for zero-shot evaluation of information retrieval models. It includes 18 publicly available datasets from diverse text retrieval tasks and domains, and evaluates 10 state-of-the-art retrieval systems, including lexical, sparse, dense, late-interaction, and re-ranking architectures. The results show that BM25 is a robust baseline, while re-ranking and late-interaction models achieve the best zero-shot performances, though at high computational costs. Dense and sparse retrieval models are more efficient but often underperform other approaches, highlighting the need for improvement in their generalization capabilities. BEIR provides a robust and heterogeneous evaluation benchmark for information retrieval, allowing researchers to better evaluate and understand existing retrieval systems. It is publicly available at https://github.com/UKPLab/beir. The benchmark includes nine different retrieval tasks and diverse text domains, with datasets of various sizes and query lengths. BEIR allows for the evaluation of retrieval methods across a wide range of tasks and domains, and provides a standardized data format and easy-to-use code examples for many different retrieval strategies. The benchmark also highlights the impact of annotation selection bias, showing that lexical approaches often outperform dense retrieval methods on certain datasets. The study demonstrates that in-domain performance does not necessarily predict zero-shot generalization, and that cross-attentional re-ranking and late-interaction models perform well across the evaluated tasks. BEIR provides a unified benchmark for evaluating the zero-shot capabilities of retrieval systems, and is model-agnostic, welcoming methods of all kinds. The benchmark is designed to be extensible and allows for the integration of new tasks and datasets. The results show that dense retrieval models often underperform BM25 on certain datasets, while re-ranking and late-interaction models perform better. The benchmark also highlights the trade-off between performance and computational cost, with re-ranking models being more computationally expensive but performing better. Overall, BM25 remains a strong baseline for zero-shot text retrieval.BEIR is a heterogeneous benchmark for zero-shot evaluation of information retrieval models. It includes 18 publicly available datasets from diverse text retrieval tasks and domains, and evaluates 10 state-of-the-art retrieval systems, including lexical, sparse, dense, late-interaction, and re-ranking architectures. The results show that BM25 is a robust baseline, while re-ranking and late-interaction models achieve the best zero-shot performances, though at high computational costs. Dense and sparse retrieval models are more efficient but often underperform other approaches, highlighting the need for improvement in their generalization capabilities. BEIR provides a robust and heterogeneous evaluation benchmark for information retrieval, allowing researchers to better evaluate and understand existing retrieval systems. It is publicly available at https://github.com/UKPLab/beir. The benchmark includes nine different retrieval tasks and diverse text domains, with datasets of various sizes and query lengths. BEIR allows for the evaluation of retrieval methods across a wide range of tasks and domains, and provides a standardized data format and easy-to-use code examples for many different retrieval strategies. The benchmark also highlights the impact of annotation selection bias, showing that lexical approaches often outperform dense retrieval methods on certain datasets. The study demonstrates that in-domain performance does not necessarily predict zero-shot generalization, and that cross-attentional re-ranking and late-interaction models perform well across the evaluated tasks. BEIR provides a unified benchmark for evaluating the zero-shot capabilities of retrieval systems, and is model-agnostic, welcoming methods of all kinds. The benchmark is designed to be extensible and allows for the integration of new tasks and datasets. The results show that dense retrieval models often underperform BM25 on certain datasets, while re-ranking and late-interaction models perform better. The benchmark also highlights the trade-off between performance and computational cost, with re-ranking models being more computationally expensive but performing better. Overall, BM25 remains a strong baseline for zero-shot text retrieval.