Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

10 Apr 2024 | Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen
Ada-LEval is a novel benchmark designed to evaluate the long-context understanding capabilities of large language models (LLMs). It addresses the limitations of existing benchmarks, such as L-Eval and LongBench, which primarily focus on QA and summarization tasks and do not cover ultra-long contexts (100k+ tokens). Ada-LEval includes two challenging tasks: TSort and BestAnswer, which require models to manipulate and understand text segments of varying lengths, up to 128k tokens. The benchmark supports precise manipulation of test case lengths and evaluates both proprietary and open-source models. Experiments on Ada-LEval reveal significant performance gaps between models, particularly in ultra-long contexts, highlighting the need for more advanced techniques to handle long texts effectively. The study also identifies limitations in current LLMs, such as poor instruction following and input order bias, and explores the effectiveness of scalable position embedding techniques. The results underscore the importance of comprehensive evaluation methods for long-context understanding in LLMs.Ada-LEval is a novel benchmark designed to evaluate the long-context understanding capabilities of large language models (LLMs). It addresses the limitations of existing benchmarks, such as L-Eval and LongBench, which primarily focus on QA and summarization tasks and do not cover ultra-long contexts (100k+ tokens). Ada-LEval includes two challenging tasks: TSort and BestAnswer, which require models to manipulate and understand text segments of varying lengths, up to 128k tokens. The benchmark supports precise manipulation of test case lengths and evaluates both proprietary and open-source models. Experiments on Ada-LEval reveal significant performance gaps between models, particularly in ultra-long contexts, highlighting the need for more advanced techniques to handle long texts effectively. The study also identifies limitations in current LLMs, such as poor instruction following and input order bias, and explores the effectiveness of scalable position embedding techniques. The results underscore the importance of comprehensive evaluation methods for long-context understanding in LLMs.
Reach us at info@study.space