[slides] Summary of a Haystack%3A A Challenge to Long-Context LLMs and RAG Systems

This paper introduces the "Summary of a Haystack" (SummHay) task, a novel evaluation framework for long-context language models (LLMs) and retrieval-Augmented Generation (RAG) systems. The SummHay task aims to assess the ability of systems to summarize large sets of documents accurately and precisely cite relevant insights. The authors generate synthetic "Haystacks" of documents, ensuring that specific insights repeat across documents, and then evaluate systems on their ability to generate summaries that cover these insights and cite the source documents. The evaluation is based on two metrics: Coverage, which measures the overlap between the generated summary and the reference insights, and Citation, which evaluates the precision and thoroughness of the citations. The study involves generating 10 Haystacks in two domains (conversations and news) and evaluating 10 LLMs and 50 RAG systems. The results show that current systems significantly underperform human performance, even when provided with oracle signals of document relevance. The paper also highlights the trade-offs between RAG pipelines and long-context LLMs, with RAG systems generally improving citation quality at the cost of insight coverage. Additionally, the study confirms the existence of positional bias in LLMs, where they tend to prioritize information at the extremes of the context window. The authors hope that future systems can achieve and surpass human performance on SummHay, providing more reliable and trustworthy answer engines.This paper introduces the "Summary of a Haystack" (SummHay) task, a novel evaluation framework for long-context language models (LLMs) and retrieval-Augmented Generation (RAG) systems. The SummHay task aims to assess the ability of systems to summarize large sets of documents accurately and precisely cite relevant insights. The authors generate synthetic "Haystacks" of documents, ensuring that specific insights repeat across documents, and then evaluate systems on their ability to generate summaries that cover these insights and cite the source documents. The evaluation is based on two metrics: Coverage, which measures the overlap between the generated summary and the reference insights, and Citation, which evaluates the precision and thoroughness of the citations. The study involves generating 10 Haystacks in two domains (conversations and news) and evaluating 10 LLMs and 50 RAG systems. The results show that current systems significantly underperform human performance, even when provided with oracle signals of document relevance. The paper also highlights the trade-offs between RAG pipelines and long-context LLMs, with RAG systems generally improving citation quality at the cost of insight coverage. Additionally, the study confirms the existence of positional bias in LLMs, where they tend to prioritize information at the extremes of the context window. The authors hope that future systems can achieve and surpass human performance on SummHay, providing more reliable and trustworthy answer engines.

Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

2024-07-01 | Philippe Laban, Alexander R. Fabbri, Caiming Xiong, Chien-Sheng Wu