One Thousand and One Pairs: A “novel” challenge for long-context language models

One Thousand and One Pairs: A “novel” challenge for long-context language models

18 Jul 2024 | Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, Mohit Iyyer
The paper introduces NoCHA, a dataset designed to evaluate the long-context reasoning capabilities of large language models (LLMs). NoCHA contains 1,001 pairs of true and false claims about 67 recently published English fiction books, created by human readers. Unlike synthetic benchmarks like "needle-in-a-haystack" (NIAH), NoCHA requires models to reason over the entire book to verify claims, which is a more challenging task. Experiments with 11 LLMs (5 open-weight and 6 closed-source) show that while human readers easily perform this task, no open-weight model achieves above random chance, with GPT-4o achieving the highest accuracy at 55.8%. The study reveals that models struggle more with pairs requiring global reasoning compared to sentence-level retrieval, and their explanations for decisions are often inaccurate. Additionally, models perform worse on speculative fiction books with extensive world-building. The methodology used in NoCHA can be extended to evaluate future models and datasets.The paper introduces NoCHA, a dataset designed to evaluate the long-context reasoning capabilities of large language models (LLMs). NoCHA contains 1,001 pairs of true and false claims about 67 recently published English fiction books, created by human readers. Unlike synthetic benchmarks like "needle-in-a-haystack" (NIAH), NoCHA requires models to reason over the entire book to verify claims, which is a more challenging task. Experiments with 11 LLMs (5 open-weight and 6 closed-source) show that while human readers easily perform this task, no open-weight model achieves above random chance, with GPT-4o achieving the highest accuracy at 55.8%. The study reveals that models struggle more with pairs requiring global reasoning compared to sentence-level retrieval, and their explanations for decisions are often inaccurate. Additionally, models perform worse on speculative fiction books with extensive world-building. The methodology used in NoCHA can be extended to evaluate future models and datasets.
Reach us at info@study.space
Understanding One Thousand and One Pairs%3A A %E2%80%9Cnovel%E2%80%9D challenge for long-context language models