6 Jun 2024 | Larisa Markeeva, Sean McLeish, Borja Ibarz, Wilfried Bounsi, Olga Kozlova, Alex Vitvitskyi, Charles Blundell, Tom Goldstein, Avi Schwarzschild, Petar Veličković
The CLRS-Text algorithmic reasoning language benchmark is a textual version of the CLRS benchmark, which generates algorithmic traces from the Introduction to Algorithms textbook. CLRS-Text allows for procedural generation of trace data for thirty diverse algorithmic tasks across any input distribution, providing a standard pipeline for creating new algorithmic tasks. The benchmark is designed to evaluate the reasoning capabilities of language models (LMs) by simulating how models adapt to unfamiliar situations. It addresses the challenge of evaluating reasoning performance on static datasets, which can lead to an illusion of progress. CLRS-Text is a procedural dataset generator that converts graph-based traces into textual form, making them suitable for ingestion by language models. It enables the evaluation of LMs on out-of-distribution generalisation, which is crucial for robust reasoning. The benchmark allows for easy generation of bespoke trace data at various distributions and simplifies comparisons across multiple papers. The CLRS-Text benchmark is evaluated using a zero-shot approach, where models are tested on randomly sampled instances for each algorithm. The results show that the use of randomised positional embeddings improves generalisation, but the extrapolation performance of language models is limited. The benchmark also highlights the importance of evaluating base model capabilities without relying on tool use or code interpreters. The CLRS-Text benchmark provides a valuable resource for evaluating the reasoning capabilities of language models, enabling the development of more robust and generalist models.The CLRS-Text algorithmic reasoning language benchmark is a textual version of the CLRS benchmark, which generates algorithmic traces from the Introduction to Algorithms textbook. CLRS-Text allows for procedural generation of trace data for thirty diverse algorithmic tasks across any input distribution, providing a standard pipeline for creating new algorithmic tasks. The benchmark is designed to evaluate the reasoning capabilities of language models (LMs) by simulating how models adapt to unfamiliar situations. It addresses the challenge of evaluating reasoning performance on static datasets, which can lead to an illusion of progress. CLRS-Text is a procedural dataset generator that converts graph-based traces into textual form, making them suitable for ingestion by language models. It enables the evaluation of LMs on out-of-distribution generalisation, which is crucial for robust reasoning. The benchmark allows for easy generation of bespoke trace data at various distributions and simplifies comparisons across multiple papers. The CLRS-Text benchmark is evaluated using a zero-shot approach, where models are tested on randomly sampled instances for each algorithm. The results show that the use of randomised positional embeddings improves generalisation, but the extrapolation performance of language models is limited. The benchmark also highlights the importance of evaluating base model capabilities without relying on tool use or code interpreters. The CLRS-Text benchmark provides a valuable resource for evaluating the reasoning capabilities of language models, enabling the development of more robust and generalist models.