LongICLBench: Long-context LLMs Struggle with Long In-context Learning

LongICLBench: Long-context LLMs Struggle with Long In-context Learning

12 Jun 2024 | Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhui Chen
The paper introduces LongICLBench, a benchmark for evaluating long in-context learning (ICL) in extreme-label classification tasks. The benchmark consists of six datasets with 28 to 174 classes and input lengths ranging from 2K to 50K tokens. The authors evaluate 15 long-context LLMs on this benchmark and find that while these models perform well on less challenging tasks with smaller label spaces and shorter demonstrations, they struggle with more complex tasks like Discovery, which has 174 labels. The study reveals that long context understanding and reasoning remain significant challenges for existing LLMs. Further analysis shows a bias towards labels presented later in the sequence and the need for improved reasoning over multiple pieces of information. The paper concludes by highlighting the limitations of current LLMs in handling long, context-rich sequences and suggests that LongICLBench could serve as a more realistic evaluation tool for future long-context LLMs.The paper introduces LongICLBench, a benchmark for evaluating long in-context learning (ICL) in extreme-label classification tasks. The benchmark consists of six datasets with 28 to 174 classes and input lengths ranging from 2K to 50K tokens. The authors evaluate 15 long-context LLMs on this benchmark and find that while these models perform well on less challenging tasks with smaller label spaces and shorter demonstrations, they struggle with more complex tasks like Discovery, which has 174 labels. The study reveals that long context understanding and reasoning remain significant challenges for existing LLMs. Further analysis shows a bias towards labels presented later in the sequence and the need for improved reasoning over multiple pieces of information. The paper concludes by highlighting the limitations of current LLMs in handling long, context-rich sequences and suggests that LongICLBench could serve as a more realistic evaluation tool for future long-context LLMs.
Reach us at info@study.space
Understanding Long-context LLMs Struggle with Long In-context Learning