NOCHA is a novel benchmark dataset consisting of 1,001 narrative minimal pairs of true and false claims about 67 recently published English fictional books. The dataset was created to evaluate the ability of long-context language models (LLMs) to retrieve, synthesize, and reason over information across book-length inputs. Unlike existing benchmarks, NOCHA requires global reasoning over the entire book to verify most pairs, making it a more challenging task for LLMs. The dataset was created by human annotators who self-reported recently published novels they had read and generated true/false pairs that isolate a single narrative phenomenon. The dataset includes a total of 1,001 pairs for 67 books, created at a cost of $3,330 USD.
Experiments on 5 open-source and 6 closed-source models show that while human readers easily perform this task, it is extremely challenging for all ten LLMs evaluated. No open-source model performs above random chance, while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that models perform much better on pairs requiring only sentence-level retrieval than global reasoning. Additionally, model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims. Models also perform substantially worse on speculative fiction books that contain extensive world-building.
The methodology proposed in NOCHA allows for the evolution of the benchmark dataset and the easy analysis of future models. The dataset is designed to evaluate the long-context reasoning capabilities of LLMs in a realistic task setting. The results show that models that are "state-of-the-art" according to synthetic benchmarks like NIAH actually perform very poorly on NOCHA. However, synthetic datasets are useful and complementary to realistic datasets, allowing for higher flexibility in evaluating different context lengths and analyzing the lost-in-the-middle phenomenon. The study encourages researchers to use a holistic approach and consider both synthetic and realistic tasks when evaluating long-context language models. The study also acknowledges the limitations of the work, including the focus on English novels and the task of claim verification. The study also highlights the ethical considerations of the research, including the use and publication of annotations and the fair compensation of annotators. The study also acknowledges the contributions of the research team and the support received from various funding sources.NOCHA is a novel benchmark dataset consisting of 1,001 narrative minimal pairs of true and false claims about 67 recently published English fictional books. The dataset was created to evaluate the ability of long-context language models (LLMs) to retrieve, synthesize, and reason over information across book-length inputs. Unlike existing benchmarks, NOCHA requires global reasoning over the entire book to verify most pairs, making it a more challenging task for LLMs. The dataset was created by human annotators who self-reported recently published novels they had read and generated true/false pairs that isolate a single narrative phenomenon. The dataset includes a total of 1,001 pairs for 67 books, created at a cost of $3,330 USD.
Experiments on 5 open-source and 6 closed-source models show that while human readers easily perform this task, it is extremely challenging for all ten LLMs evaluated. No open-source model performs above random chance, while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that models perform much better on pairs requiring only sentence-level retrieval than global reasoning. Additionally, model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims. Models also perform substantially worse on speculative fiction books that contain extensive world-building.
The methodology proposed in NOCHA allows for the evolution of the benchmark dataset and the easy analysis of future models. The dataset is designed to evaluate the long-context reasoning capabilities of LLMs in a realistic task setting. The results show that models that are "state-of-the-art" according to synthetic benchmarks like NIAH actually perform very poorly on NOCHA. However, synthetic datasets are useful and complementary to realistic datasets, allowing for higher flexibility in evaluating different context lengths and analyzing the lost-in-the-middle phenomenon. The study encourages researchers to use a holistic approach and consider both synthetic and realistic tasks when evaluating long-context language models. The study also acknowledges the limitations of the work, including the focus on English novels and the task of claim verification. The study also highlights the ethical considerations of the research, including the use and publication of annotations and the fair compensation of annotators. The study also acknowledges the contributions of the research team and the support received from various funding sources.