∞BENCH: Extending Long Context Evaluation Beyond 100K Tokens

∞BENCH: Extending Long Context Evaluation Beyond 100K Tokens

24 Feb 2024 | Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Kai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, Maosong Sun
∞BENCH is a new benchmark for evaluating the ability of large language models (LLMs) to process long contexts, with an average data length exceeding 100K tokens. The benchmark includes synthetic and realistic tasks across diverse domains, presented in both English and Chinese. The tasks require deep understanding of long dependencies in contexts, and simply retrieving a limited number of passages is not sufficient for these tasks. The benchmark evaluates the performance of state-of-the-art proprietary and open-source LLMs tailored for processing long contexts. The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context. The benchmark also presents three intriguing analyses regarding the behavior of LLMs processing long context. The code and data are released. The benchmark includes 12 tasks spanning 5 domains: retrieval, code, math, novels, and dialogue. Two of these tasks are derived from existing literature. Among the newly introduced tasks, half are generated automatically, while the remainder are annotated by humans. In total, ∞BENCH includes 3946 examples, featuring a length beyond 100K tokens (average approximately 200K). The tasks are grouped into two broad categories: realistic context and synthetic context. Realistic context tasks involve real-world scenarios that have potential practical usage of long context LLMs. Synthetic context tasks are created or collected for testing certain capabilities of long-context LLMs. The benchmark evaluates the performance of several state-of-the-art (SOTA) long-context LLMs on this benchmark to gauge its difficulty and evaluate the effectiveness of these models. The results show that current SOTA LLMs are not fully equipped to handle all tasks within ∞BENCH, highlighting the ongoing challenge of enabling LLMs to process long contexts effectively. The benchmark also conducts intriguing analyses on the behavior of LLMs on such long contexts, including the task length ablation, the absent of "lost in the middle phenomenon", and the context recalling prompting techniques. The contributions of this work include the construction and release of ∞BENCH, the first multi-domain bilingual benchmark for evaluating the ability to understand and reason over contexts surpassing 100K tokens. The benchmark also evaluates SOTA long-context LLMs, which reveals severe performance degradation of these LLMs when scaling context lengths. These experimental results and analysis also indicate promising directions to improve long-context LLMs.∞BENCH is a new benchmark for evaluating the ability of large language models (LLMs) to process long contexts, with an average data length exceeding 100K tokens. The benchmark includes synthetic and realistic tasks across diverse domains, presented in both English and Chinese. The tasks require deep understanding of long dependencies in contexts, and simply retrieving a limited number of passages is not sufficient for these tasks. The benchmark evaluates the performance of state-of-the-art proprietary and open-source LLMs tailored for processing long contexts. The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context. The benchmark also presents three intriguing analyses regarding the behavior of LLMs processing long context. The code and data are released. The benchmark includes 12 tasks spanning 5 domains: retrieval, code, math, novels, and dialogue. Two of these tasks are derived from existing literature. Among the newly introduced tasks, half are generated automatically, while the remainder are annotated by humans. In total, ∞BENCH includes 3946 examples, featuring a length beyond 100K tokens (average approximately 200K). The tasks are grouped into two broad categories: realistic context and synthetic context. Realistic context tasks involve real-world scenarios that have potential practical usage of long context LLMs. Synthetic context tasks are created or collected for testing certain capabilities of long-context LLMs. The benchmark evaluates the performance of several state-of-the-art (SOTA) long-context LLMs on this benchmark to gauge its difficulty and evaluate the effectiveness of these models. The results show that current SOTA LLMs are not fully equipped to handle all tasks within ∞BENCH, highlighting the ongoing challenge of enabling LLMs to process long contexts effectively. The benchmark also conducts intriguing analyses on the behavior of LLMs on such long contexts, including the task length ablation, the absent of "lost in the middle phenomenon", and the context recalling prompting techniques. The contributions of this work include the construction and release of ∞BENCH, the first multi-domain bilingual benchmark for evaluating the ability to understand and reason over contexts surpassing 100K tokens. The benchmark also evaluates SOTA long-context LLMs, which reveals severe performance degradation of these LLMs when scaling context lengths. These experimental results and analysis also indicate promising directions to improve long-context LLMs.
Reach us at info@study.space