Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

10 Apr 2024 | Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen
Ada-LEval is a length-adaptable benchmark for evaluating the long-context understanding of large language models (LLMs). It includes two challenging tasks: TSort and BestAnswer. TSort requires sorting text segments in the correct order, while BestAnswer involves selecting the best answer from multiple options. These tasks allow for precise evaluation of LLMs' ability to understand and reason over long texts. Ada-LEval supports test cases of varying lengths, up to 128k tokens, and evaluates 4 closed-source and 6 open-source models. The results show that existing LLMs struggle with ultra-long contexts, with performance declining as text length increases. The benchmark also highlights issues such as limited instruction following and input order bias. Additionally, it demonstrates that scalable position embeddings improve performance in long-context settings. Ada-LEval is the first benchmark to evaluate LLMs under ultra-long contexts, revealing the limitations of current models and providing insights for future developments. The benchmark supports length-adaptable evaluation, ensuring accurate and comprehensive assessment of LLMs' long-context capabilities.Ada-LEval is a length-adaptable benchmark for evaluating the long-context understanding of large language models (LLMs). It includes two challenging tasks: TSort and BestAnswer. TSort requires sorting text segments in the correct order, while BestAnswer involves selecting the best answer from multiple options. These tasks allow for precise evaluation of LLMs' ability to understand and reason over long texts. Ada-LEval supports test cases of varying lengths, up to 128k tokens, and evaluates 4 closed-source and 6 open-source models. The results show that existing LLMs struggle with ultra-long contexts, with performance declining as text length increases. The benchmark also highlights issues such as limited instruction following and input order bias. Additionally, it demonstrates that scalable position embeddings improve performance in long-context settings. Ada-LEval is the first benchmark to evaluate LLMs under ultra-long contexts, revealing the limitations of current models and providing insights for future developments. The benchmark supports length-adaptable evaluation, ensuring accurate and comprehensive assessment of LLMs' long-context capabilities.
Reach us at info@study.space
[slides] Ada-LEval%3A Evaluating long-context LLMs with length-adaptable benchmarks | StudySpace