BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

14 Jun 2024 | Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev
The paper introduces the BABILong benchmark, designed to evaluate the long-context reasoning capabilities of large language models (LLMs). The benchmark includes 20 diverse reasoning tasks, such as fact chaining, induction, deduction, counting, and handling lists/sets, and uses long natural text documents from the PG19 corpus. The evaluation reveals that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity. Retrieval-Augmented Generation (RAG) methods achieve modest accuracy on single-fact questions, while recurrent memory transformers (RMT) and fine-tuned small models like Mamba perform better, with RMT capable of processing up to 11 million tokens. The BABILong benchmark is scalable and can be extended to support longer contexts, making it a valuable tool for evaluating new models with advanced capabilities.The paper introduces the BABILong benchmark, designed to evaluate the long-context reasoning capabilities of large language models (LLMs). The benchmark includes 20 diverse reasoning tasks, such as fact chaining, induction, deduction, counting, and handling lists/sets, and uses long natural text documents from the PG19 corpus. The evaluation reveals that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity. Retrieval-Augmented Generation (RAG) methods achieve modest accuracy on single-fact questions, while recurrent memory transformers (RMT) and fine-tuned small models like Mamba perform better, with RMT capable of processing up to 11 million tokens. The BABILong benchmark is scalable and can be extended to support longer contexts, making it a valuable tool for evaluating new models with advanced capabilities.
Reach us at info@study.space
[slides and audio] BABILong%3A Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack