LongICLBench: Long-context LLMs Struggle with Long In-context Learning
This paper introduces LongICLBench, a benchmark for evaluating long in-context learning in extreme-label classification tasks. The benchmark consists of six datasets with varying difficulty levels, featuring input lengths from 2K to 50K tokens and label spaces ranging from 28 to 174 classes. The benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces and make correct predictions. The study evaluates 15 long-context LLMs and finds that they perform well on less challenging tasks with smaller label spaces and shorter demonstrations, but struggle with more challenging tasks like Discovery with 174 labels. Analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. The study highlights that long context understanding and reasoning remain challenging for existing LLMs. LongICLBench is proposed as a more realistic evaluation for future long-context LLMs. The paper also explores the impact of instance distribution on model performance, finding that models are sensitive to the positioning of instances within the prompt. The results show that some models, like GPT4-turbo, consistently benefit from more demonstrations, while others, like Gemini-1.5-Pro, perform well on the most challenging tasks. The study concludes that long in-context learning remains a significant challenge for LLMs, and LongICLBench provides a valuable benchmark for evaluating their capabilities.LongICLBench: Long-context LLMs Struggle with Long In-context Learning
This paper introduces LongICLBench, a benchmark for evaluating long in-context learning in extreme-label classification tasks. The benchmark consists of six datasets with varying difficulty levels, featuring input lengths from 2K to 50K tokens and label spaces ranging from 28 to 174 classes. The benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces and make correct predictions. The study evaluates 15 long-context LLMs and finds that they perform well on less challenging tasks with smaller label spaces and shorter demonstrations, but struggle with more challenging tasks like Discovery with 174 labels. Analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. The study highlights that long context understanding and reasoning remain challenging for existing LLMs. LongICLBench is proposed as a more realistic evaluation for future long-context LLMs. The paper also explores the impact of instance distribution on model performance, finding that models are sensitive to the positioning of instances within the prompt. The results show that some models, like GPT4-turbo, consistently benefit from more demonstrations, while others, like Gemini-1.5-Pro, perform well on the most challenging tasks. The study concludes that long in-context learning remains a significant challenge for LLMs, and LongICLBench provides a valuable benchmark for evaluating their capabilities.