FABLES is a large-scale human evaluation of faithfulness and content selection in book-length summarization. The study focuses on summaries of books published in 2023 or 2024, with annotators who have read each book prior to annotation to minimize bias. The dataset includes 3,158 claim-level annotations across 26 books, collected at a cost of $5,200. The results show that CLAUDE-3-OPUS is the most faithful summarizer, followed by GPT-4-TURBO. However, automatic faithfulness evaluation remains challenging, as no LLM-based rater strongly correlates with human annotations, especially for detecting unfaithful claims.
The study also explores content selection errors, such as omissions of key events, details, and themes, and identifies a systematic overemphasis on book endings. A taxonomy of omission errors is developed, revealing that key narrative elements are frequently omitted by all LLMs. Additionally, models like CLAUDE-3-OPUS and GPT-4-TURBO tend to overemphasize content towards the end of books, negatively affecting the beginning. The study highlights the difficulty of evaluating faithfulness and content selection in book-length summarization, emphasizing the need for further research in this area. FABLES provides a benchmark for evaluating long-context understanding and summarization quality.FABLES is a large-scale human evaluation of faithfulness and content selection in book-length summarization. The study focuses on summaries of books published in 2023 or 2024, with annotators who have read each book prior to annotation to minimize bias. The dataset includes 3,158 claim-level annotations across 26 books, collected at a cost of $5,200. The results show that CLAUDE-3-OPUS is the most faithful summarizer, followed by GPT-4-TURBO. However, automatic faithfulness evaluation remains challenging, as no LLM-based rater strongly correlates with human annotations, especially for detecting unfaithful claims.
The study also explores content selection errors, such as omissions of key events, details, and themes, and identifies a systematic overemphasis on book endings. A taxonomy of omission errors is developed, revealing that key narrative elements are frequently omitted by all LLMs. Additionally, models like CLAUDE-3-OPUS and GPT-4-TURBO tend to overemphasize content towards the end of books, negatively affecting the beginning. The study highlights the difficulty of evaluating faithfulness and content selection in book-length summarization, emphasizing the need for further research in this area. FABLES provides a benchmark for evaluating long-context understanding and summarization quality.