20 Nov 2023 | Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang
The paper "Lost in the Middle: How Language Models Use Long Contexts" by Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang explores the performance of language models on tasks that require identifying relevant information within long input contexts. The authors analyze two specific tasks: multi-document question answering and key-value retrieval. They find that performance significantly degrades when relevant information is placed in the middle of the input context, indicating that current language models do not robustly utilize information in long contexts. Specifically, performance is highest when relevant information is at the beginning or end of the context, and it degrades sharply when it is in the middle. The study also examines the impact of model architecture, query-aware contextualization, and instruction fine-tuning on this behavior. The results suggest that providing longer contexts to language models is a trade-off, as it increases the amount of content the model must process, potentially reducing accuracy. The paper concludes with a case study on open-domain question answering, where model performance saturates before retriever recall, further highlighting the challenges of using long contexts effectively.The paper "Lost in the Middle: How Language Models Use Long Contexts" by Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang explores the performance of language models on tasks that require identifying relevant information within long input contexts. The authors analyze two specific tasks: multi-document question answering and key-value retrieval. They find that performance significantly degrades when relevant information is placed in the middle of the input context, indicating that current language models do not robustly utilize information in long contexts. Specifically, performance is highest when relevant information is at the beginning or end of the context, and it degrades sharply when it is in the middle. The study also examines the impact of model architecture, query-aware contextualization, and instruction fine-tuning on this behavior. The results suggest that providing longer contexts to language models is a trade-off, as it increases the amount of content the model must process, potentially reducing accuracy. The paper concludes with a case study on open-domain question answering, where model performance saturates before retriever recall, further highlighting the challenges of using long contexts effectively.