[slides and audio] The CLRS-Text Algorithmic Reasoning Language Benchmark

The paper introduces CLRS-Text, a textual version of the CLRS benchmark, designed to evaluate the reasoning capabilities of language models (LMs). The CLRS benchmark, inspired by the *Introduction to Algorithms* textbook, generates graph execution traces of classical algorithms. CLRS-Text converts these traces into textual form, making them suitable for LM input. This benchmark aims to address the limitations of static datasets and out-of-distribution performance, which often result in an illusion of progress and the reasoning gap. CLRS-Text is a procedural dataset generator that can create trace data for diverse algorithmic tasks across various input distributions. It offers a standard pipeline for creating new algorithmic tasks and evaluating LM performance. The paper details the construction of CLRS-Text, including the conversion of graph traces to text, and provides examples of algorithmic traces in both graphical and textual formats. The evaluation section focuses on the Gemma 2B model, pre-trained on CLRS-Text tasks using next-token prediction. The model is evaluated zero-shot on randomly sampled CLRS-Text instances for each of the thirty algorithms, with results indicating that randomised positional embeddings improve generalisation but still struggle in extrapolation. The paper also highlights the importance of avoiding confounding effects from tool use and code interpreters, which are common in other benchmarking approaches. Overall, CLRS-Text provides a robust and flexible benchmark for evaluating LM reasoning capabilities, allowing for easy generation of bespoke trace data and facilitating comparisons across multiple papers. The paper concludes with a discussion of the challenges and future directions in LM algorithmic reasoning.The paper introduces CLRS-Text, a textual version of the CLRS benchmark, designed to evaluate the reasoning capabilities of language models (LMs). The CLRS benchmark, inspired by the *Introduction to Algorithms* textbook, generates graph execution traces of classical algorithms. CLRS-Text converts these traces into textual form, making them suitable for LM input. This benchmark aims to address the limitations of static datasets and out-of-distribution performance, which often result in an illusion of progress and the reasoning gap. CLRS-Text is a procedural dataset generator that can create trace data for diverse algorithmic tasks across various input distributions. It offers a standard pipeline for creating new algorithmic tasks and evaluating LM performance. The paper details the construction of CLRS-Text, including the conversion of graph traces to text, and provides examples of algorithmic traces in both graphical and textual formats. The evaluation section focuses on the Gemma 2B model, pre-trained on CLRS-Text tasks using next-token prediction. The model is evaluated zero-shot on randomly sampled CLRS-Text instances for each of the thirty algorithms, with results indicating that randomised positional embeddings improve generalisation but still struggle in extrapolation. The paper also highlights the importance of avoiding confounding effects from tool use and code interpreters, which are common in other benchmarking approaches. Overall, CLRS-Text provides a robust and flexible benchmark for evaluating LM reasoning capabilities, allowing for easy generation of bespoke trace data and facilitating comparisons across multiple papers. The paper concludes with a discussion of the challenges and future directions in LM algorithmic reasoning.

The CLRS-Text Algorithmic Reasoning Language Benchmark

2024 | Larisa Markeeva, Sean McLeish, Borja Ibarz, Wilfried Bounsi, Olga Kozlova, Alex Vitvitskyi, Charles Blundell, Tom Goldstein, Avi Schwarzschild, Petar Veličković