16 Jul 2024 | Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan Ö, Arık, Danqi Chen, Tao Yu
BRIGHT is a new benchmark for reasoning-intensive retrieval that requires intensive reasoning to find relevant documents. It is built from 1,398 real-world queries across diverse domains, sourced from naturally occurring or carefully curated human data. BRIGHT is challenging for state-of-the-art retrieval models, as shown by the fact that the top model on the MTEB leaderboard achieves only 18.0 nDCG@10 on BRIGHT, compared to 59.0 on the MTEB benchmark. Augmenting queries with Chain-of-Thought reasoning generated by large language models (LLMs) improves performance by up to 12.2 points. BRIGHT is also robust against data leakage during pretraining of the benchmarked models. The benchmark includes datasets from StackExchange, coding, and theorem-based questions, and is designed to test whether retrieval systems can match queries and documents whose relevance requires intensive reasoning. BRIGHT is evaluated using 13 retrieval models, and results show that current retrieval systems struggle with reasoning-intensive tasks. The benchmark is expected to inspire future research on retrieval systems in more realistic and challenging settings. The code and data are available at https://brightbenchmark.github.io.BRIGHT is a new benchmark for reasoning-intensive retrieval that requires intensive reasoning to find relevant documents. It is built from 1,398 real-world queries across diverse domains, sourced from naturally occurring or carefully curated human data. BRIGHT is challenging for state-of-the-art retrieval models, as shown by the fact that the top model on the MTEB leaderboard achieves only 18.0 nDCG@10 on BRIGHT, compared to 59.0 on the MTEB benchmark. Augmenting queries with Chain-of-Thought reasoning generated by large language models (LLMs) improves performance by up to 12.2 points. BRIGHT is also robust against data leakage during pretraining of the benchmarked models. The benchmark includes datasets from StackExchange, coding, and theorem-based questions, and is designed to test whether retrieval systems can match queries and documents whose relevance requires intensive reasoning. BRIGHT is evaluated using 13 retrieval models, and results show that current retrieval systems struggle with reasoning-intensive tasks. The benchmark is expected to inspire future research on retrieval systems in more realistic and challenging settings. The code and data are available at https://brightbenchmark.github.io.