Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

25 Jun 2024 | Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, Yongbin Li
The paper introduces Loong, a novel benchmark designed to evaluate the long-context understanding capabilities of Large Language Models (LLMs) in real-world multi-document scenarios. Unlike existing benchmarks that use irrelevant noise texts to artificially extend test cases, Loong aligns with realistic scenarios by incorporating relevant documents that are scattered across multiple documents. Loong includes four types of tasks—Spotlight Locating, Comparison, Clustering, and Chain of Reasoning—each with varying context lengths (10K-250K tokens). Extensive experiments on Loong reveal that even the most advanced LLMs still exhibit significant room for improvement in long-context modeling. The paper also analyzes the performance of retrieval-augmented generation (RAG) models on Loong, finding that they perform poorly, highlighting the reliability of Loong in assessing long-context capabilities. Additionally, the paper discusses the scaling law of context window sizes and the limitations of RAG in enhancing long-context modeling.The paper introduces Loong, a novel benchmark designed to evaluate the long-context understanding capabilities of Large Language Models (LLMs) in real-world multi-document scenarios. Unlike existing benchmarks that use irrelevant noise texts to artificially extend test cases, Loong aligns with realistic scenarios by incorporating relevant documents that are scattered across multiple documents. Loong includes four types of tasks—Spotlight Locating, Comparison, Clustering, and Chain of Reasoning—each with varying context lengths (10K-250K tokens). Extensive experiments on Loong reveal that even the most advanced LLMs still exhibit significant room for improvement in long-context modeling. The paper also analyzes the performance of retrieval-augmented generation (RAG) models on Loong, finding that they perform poorly, highlighting the reliability of Loong in assessing long-context capabilities. Additionally, the paper discusses the scaling law of context window sizes and the limitations of RAG in enhancing long-context modeling.
Reach us at info@study.space
[slides and audio] Leave No Document Behind%3A Benchmarking Long-Context LLMs with Extended Multi-Doc QA