[slides and audio] %E2%88%9EBench%3A Extending Long Context Evaluation Beyond 100K Tokens

The paper introduces ∞BENCH, a comprehensive benchmark designed to evaluate the long-context capabilities of Large Language Models (LLMs). ∞BENCH features an average data length exceeding 100K tokens, addressing the lack of standardized benchmarks for assessing LLMs' performance on longer contexts. The benchmark includes synthetic and realistic tasks across diverse domains (novels, code, math, etc.) in both English and Chinese. The tasks are designed to require deep understanding of long dependencies and cannot be effectively handled by simply retrieving a limited number of passages. The authors evaluate state-of-the-art LLMs, including proprietary and open-source models, on ∞BENCH. The results indicate that current LLMs still struggle with processing contexts beyond 100K tokens, highlighting the need for significant advancements. The paper also presents three intriguing analyses on LLM behavior in long-context tasks, including the "lost in the middle" phenomenon and the effectiveness of context recalling techniques. ∞BENCH is the first benchmark to address the gap in long-context evaluation, providing valuable insights into the current limitations and potential directions for improving LLMs' long-context processing capabilities.The paper introduces ∞BENCH, a comprehensive benchmark designed to evaluate the long-context capabilities of Large Language Models (LLMs). ∞BENCH features an average data length exceeding 100K tokens, addressing the lack of standardized benchmarks for assessing LLMs' performance on longer contexts. The benchmark includes synthetic and realistic tasks across diverse domains (novels, code, math, etc.) in both English and Chinese. The tasks are designed to require deep understanding of long dependencies and cannot be effectively handled by simply retrieving a limited number of passages. The authors evaluate state-of-the-art LLMs, including proprietary and open-source models, on ∞BENCH. The results indicate that current LLMs still struggle with processing contexts beyond 100K tokens, highlighting the need for significant advancements. The paper also presents three intriguing analyses on LLM behavior in long-context tasks, including the "lost in the middle" phenomenon and the effectiveness of context recalling techniques. ∞BENCH is the first benchmark to address the gap in long-context evaluation, providing valuable insights into the current limitations and potential directions for improving LLMs' long-context processing capabilities.

∞BENCH: Extending Long Context Evaluation Beyond 100K Tokens

24 Feb 2024 | Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, Maosong Sun