16 Jul 2024 | Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han-yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan Ö. Arık, Danqi Chen, Tao Yu
The paper introduces BRIGHT, a new benchmark for text retrieval that focuses on *intensive reasoning* to identify relevant documents. Unlike existing benchmarks that primarily consist of simple keyword or semantic-based retrieval tasks, BRIGHT is designed to simulate real-world scenarios where complex queries require deep understanding and reasoning. The benchmark includes 12 datasets from diverse domains, such as economics, psychology, robotics, and software engineering, sourced from natural user queries or curated human data.
The evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT, achieving only an average nDCG@10 score of 22.1. To improve performance, the authors explore strategies such as using large language models (LLMs) to generate Chain-of-Thought reasoning steps as queries, which can enhance retrieval performance by up to 12.2 points on average. Additionally, BRIGHT is shown to be robust against data leakage during the pretraining of the benchmarked models, as performance remains stable even when documents from the benchmark are included in the training data.
The paper also discusses the challenges of long-context retrieval, where retrieving information from lengthy documents remains difficult despite a reduced search space. Overall, BRIGHT aims to push the boundaries of retrieval systems in more realistic and challenging settings, and the authors hope it will inspire future research in this area.The paper introduces BRIGHT, a new benchmark for text retrieval that focuses on *intensive reasoning* to identify relevant documents. Unlike existing benchmarks that primarily consist of simple keyword or semantic-based retrieval tasks, BRIGHT is designed to simulate real-world scenarios where complex queries require deep understanding and reasoning. The benchmark includes 12 datasets from diverse domains, such as economics, psychology, robotics, and software engineering, sourced from natural user queries or curated human data.
The evaluation reveals that even state-of-the-art retrieval models perform poorly on BRIGHT, achieving only an average nDCG@10 score of 22.1. To improve performance, the authors explore strategies such as using large language models (LLMs) to generate Chain-of-Thought reasoning steps as queries, which can enhance retrieval performance by up to 12.2 points on average. Additionally, BRIGHT is shown to be robust against data leakage during the pretraining of the benchmarked models, as performance remains stable even when documents from the benchmark are included in the training data.
The paper also discusses the challenges of long-context retrieval, where retrieving information from lengthy documents remains difficult despite a reduced search space. Overall, BRIGHT aims to push the boundaries of retrieval systems in more realistic and challenging settings, and the authors hope it will inspire future research in this area.