Lost in the Middle: How Language Models Use Long Contexts

Lost in the Middle: How Language Models Use Long Contexts

20 Nov 2023 | Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang
Language models can process long input contexts, but their ability to effectively use this information is limited. This study investigates how well language models use long contexts in two tasks: multi-document question answering and key-value retrieval. The results show that performance degrades significantly when relevant information is in the middle of the input context, even for models designed for long contexts. Performance is highest when relevant information is at the beginning or end of the input, indicating a primacy and recency bias. This suggests that current models do not robustly access information in long contexts. The study also finds that extended-context models are not necessarily better at using their input context. For example, when the input context fits within the model's context window, performance between extended and non-extended models is similar. This indicates that extended context does not always improve performance. The study further explores the role of model architecture, query-aware contextualization, and instruction fine-tuning. Encoder-decoder models are more robust to changes in the position of relevant information within their input context, but only when evaluated on sequences within their training-time sequence length. Query-aware contextualization improves performance on the key-value retrieval task but has minimal effect on multi-document QA. Instruction fine-tuning does not significantly change the U-shaped performance curve observed in the models. The study also examines the trade-off between providing more context and the model's ability to process it. While more context may help with downstream tasks, it also increases the amount of information the model must reason over, potentially reducing accuracy. A case study on open-domain question answering shows that model performance saturates before retriever recall, indicating that models do not effectively use additional retrieved documents. The findings suggest that language models struggle to robustly access and use information in long input contexts, and that extended-context models are not necessarily better at this. The study provides new evaluation protocols for future long-context models and highlights the importance of understanding how models use their input context.Language models can process long input contexts, but their ability to effectively use this information is limited. This study investigates how well language models use long contexts in two tasks: multi-document question answering and key-value retrieval. The results show that performance degrades significantly when relevant information is in the middle of the input context, even for models designed for long contexts. Performance is highest when relevant information is at the beginning or end of the input, indicating a primacy and recency bias. This suggests that current models do not robustly access information in long contexts. The study also finds that extended-context models are not necessarily better at using their input context. For example, when the input context fits within the model's context window, performance between extended and non-extended models is similar. This indicates that extended context does not always improve performance. The study further explores the role of model architecture, query-aware contextualization, and instruction fine-tuning. Encoder-decoder models are more robust to changes in the position of relevant information within their input context, but only when evaluated on sequences within their training-time sequence length. Query-aware contextualization improves performance on the key-value retrieval task but has minimal effect on multi-document QA. Instruction fine-tuning does not significantly change the U-shaped performance curve observed in the models. The study also examines the trade-off between providing more context and the model's ability to process it. While more context may help with downstream tasks, it also increases the amount of information the model must reason over, potentially reducing accuracy. A case study on open-domain question answering shows that model performance saturates before retriever recall, indicating that models do not effectively use additional retrieved documents. The findings suggest that language models struggle to robustly access and use information in long input contexts, and that extended-context models are not necessarily better at this. The study provides new evaluation protocols for future long-context models and highlights the importance of understanding how models use their input context.
Reach us at info@study.space
[slides] Lost in the Middle%3A How Language Models Use Long Contexts | StudySpace