14 Jun 2024 | Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev
The BABILong benchmark is introduced to evaluate the ability of large language models (LLMs) to reason across facts distributed in extremely long documents. It includes 20 diverse reasoning tasks, such as fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are designed to challenge models in processing long contexts, where required facts are scattered across long natural text. The benchmark uses books from the PG19 corpus as a source of long natural documents, allowing for tasks of almost arbitrary length. Evaluations show that popular LLMs effectively utilize only 10-20% of the context, with performance declining sharply as task complexity increases. Retrieval-Augmented Generation (RAG) methods achieve modest accuracy, while recurrent memory transformers (RMT) demonstrate the highest performance, capable of processing up to 11 million tokens. The benchmark is extendable to any length, supporting the evaluation of new models with increased capabilities. BABILong provides splits up to 1 million tokens and evaluates models on samples up to 11 million tokens. The benchmark highlights the limitations of current models in effectively utilizing long contexts and the need for improved context processing mechanisms. It also shows that tasks from BABILong are solvable by relatively small models like RMT with GPT-2 and Mamba. The benchmark is compared with other long-context benchmarks, showing its ability to detect differences in model behavior across varying context lengths. The results indicate that BABILong offers a more representative evaluation framework for long-context reasoning compared to existing benchmarks.The BABILong benchmark is introduced to evaluate the ability of large language models (LLMs) to reason across facts distributed in extremely long documents. It includes 20 diverse reasoning tasks, such as fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are designed to challenge models in processing long contexts, where required facts are scattered across long natural text. The benchmark uses books from the PG19 corpus as a source of long natural documents, allowing for tasks of almost arbitrary length. Evaluations show that popular LLMs effectively utilize only 10-20% of the context, with performance declining sharply as task complexity increases. Retrieval-Augmented Generation (RAG) methods achieve modest accuracy, while recurrent memory transformers (RMT) demonstrate the highest performance, capable of processing up to 11 million tokens. The benchmark is extendable to any length, supporting the evaluation of new models with increased capabilities. BABILong provides splits up to 1 million tokens and evaluates models on samples up to 11 million tokens. The benchmark highlights the limitations of current models in effectively utilizing long contexts and the need for improved context processing mechanisms. It also shows that tasks from BABILong are solvable by relatively small models like RMT with GPT-2 and Mamba. The benchmark is compared with other long-context benchmarks, showing its ability to detect differences in model behavior across varying context lengths. The results indicate that BABILong offers a more representative evaluation framework for long-context reasoning compared to existing benchmarks.