25 Jun 2024 | Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, Yongbin Li
Loong is a novel benchmark for evaluating long-context understanding in multi-document scenarios. It introduces four tasks: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning. The benchmark includes 1600 test instances in English and Chinese, with varying context lengths. Loong requires that no document be ignored, as each document is relevant to the final answer. The benchmark is designed to assess the ability of large language models (LLMs) to process long contexts and perform complex reasoning. The results show that even the most advanced LLMs struggle with the tasks in Loong, indicating significant room for improvement. The benchmark also highlights the limitations of retrieval-augmented generation (RAG) in long-context tasks. Loong provides a comprehensive evaluation of LLMs across different context lengths and task complexities, offering insights into the strengths and weaknesses of current models. The benchmark is constructed using real-world documents from financial reports, legal cases, and academic papers, ensuring a realistic evaluation of LLMs in multi-document scenarios. The results demonstrate that LLMs still have considerable potential for improvement in long-context modeling.Loong is a novel benchmark for evaluating long-context understanding in multi-document scenarios. It introduces four tasks: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning. The benchmark includes 1600 test instances in English and Chinese, with varying context lengths. Loong requires that no document be ignored, as each document is relevant to the final answer. The benchmark is designed to assess the ability of large language models (LLMs) to process long contexts and perform complex reasoning. The results show that even the most advanced LLMs struggle with the tasks in Loong, indicating significant room for improvement. The benchmark also highlights the limitations of retrieval-augmented generation (RAG) in long-context tasks. Loong provides a comprehensive evaluation of LLMs across different context lengths and task complexities, offering insights into the strengths and weaknesses of current models. The benchmark is constructed using real-world documents from financial reports, legal cases, and academic papers, ensuring a realistic evaluation of LLMs in multi-document scenarios. The results demonstrate that LLMs still have considerable potential for improvement in long-context modeling.