1 Jul 2024 | Philippe Laban, Alexander R. Fabbri, Caiming Xiong, Chien-Sheng Wu
This paper introduces the SummHay benchmark task to evaluate the performance of long-context large language models (LLMs) and Retrieval-Augmented Generation (RAG) systems. The task involves generating summaries that accurately capture and cite insights from a large set of documents, with the goal of assessing the ability of systems to handle long-context tasks. The SummHay task requires systems to process a "Haystack" of documents, which are designed to contain specific insights that repeat across documents. The task then asks the system to generate a summary that identifies the relevant insights and cites the source documents.
The paper presents a procedure for generating Haystacks in two domains: conversations and news. It then evaluates 10 LLMs and 50 RAG systems on the SummHay task. The evaluation focuses on two aspects: Coverage, which measures the presence of expected insights in the summary, and Citation, which measures the accuracy and completeness of the citations. The results show that current systems struggle with the SummHay task, with even systems provided with an oracle signal of document relevance lagging behind human performance by more than 10 points on a joint score. Long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay without a retriever.
The paper also shows that SummHay can be used to study enterprise RAG systems and positional bias in long-context models. It concludes that SummHay is an open challenge for current systems and that future systems can potentially equal and surpass human performance on the task. The paper also discusses the limitations of current evaluation methods and proposes a synthetic data generation approach to address these limitations. The authors open-source their dataset and evaluation methodology.This paper introduces the SummHay benchmark task to evaluate the performance of long-context large language models (LLMs) and Retrieval-Augmented Generation (RAG) systems. The task involves generating summaries that accurately capture and cite insights from a large set of documents, with the goal of assessing the ability of systems to handle long-context tasks. The SummHay task requires systems to process a "Haystack" of documents, which are designed to contain specific insights that repeat across documents. The task then asks the system to generate a summary that identifies the relevant insights and cites the source documents.
The paper presents a procedure for generating Haystacks in two domains: conversations and news. It then evaluates 10 LLMs and 50 RAG systems on the SummHay task. The evaluation focuses on two aspects: Coverage, which measures the presence of expected insights in the summary, and Citation, which measures the accuracy and completeness of the citations. The results show that current systems struggle with the SummHay task, with even systems provided with an oracle signal of document relevance lagging behind human performance by more than 10 points on a joint score. Long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay without a retriever.
The paper also shows that SummHay can be used to study enterprise RAG systems and positional bias in long-context models. It concludes that SummHay is an open challenge for current systems and that future systems can potentially equal and surpass human performance on the task. The paper also discusses the limitations of current evaluation methods and proposes a synthetic data generation approach to address these limitations. The authors open-source their dataset and evaluation methodology.