This paper presents the first large-scale human evaluation of faithfulness and content selection in book-length summarization generated by large language models (LLMs). The authors address the challenge of evaluating input-dependent aspects like faithfulness due to the length and complexity of book documents. They mitigate data contamination by focusing on summaries of books published in 2023 or 2024 and hire annotators who have read each book prior to the annotation task. The resulting dataset, FABLES, contains 3,158 claim-level annotations of faithfulness across 26 narrative texts, with CLAUDE-3-Opus being the most faithful summarizer. The study also explores content selection errors, such as omissions and overemphasis on events towards the end of the book. The authors find that unfaithful claims are often related to events and character states, requiring indirect reasoning. They also implement LLM-based automatic raters of faithfulness but find that none correlate strongly with human annotations. The paper concludes by calling for broader error types and task settings in faithfulness evaluation benchmarks.This paper presents the first large-scale human evaluation of faithfulness and content selection in book-length summarization generated by large language models (LLMs). The authors address the challenge of evaluating input-dependent aspects like faithfulness due to the length and complexity of book documents. They mitigate data contamination by focusing on summaries of books published in 2023 or 2024 and hire annotators who have read each book prior to the annotation task. The resulting dataset, FABLES, contains 3,158 claim-level annotations of faithfulness across 26 narrative texts, with CLAUDE-3-Opus being the most faithful summarizer. The study also explores content selection errors, such as omissions and overemphasis on events towards the end of the book. The authors find that unfaithful claims are often related to events and character states, requiring indirect reasoning. They also implement LLM-based automatic raters of faithfulness but find that none correlate strongly with human annotations. The paper concludes by calling for broader error types and task settings in faithfulness evaluation benchmarks.