23 May 2024 | Zhiyuan Li¹², Hong Liu¹, Denny Zhou³, and Tengyu Ma¹
This paper investigates the theoretical and empirical capabilities of Chain of Thought (CoT) prompting for decoder-only transformers. CoT enables the model to generate intermediate steps, which allows it to perform inherently serial computations that are otherwise not possible with standard transformers. The paper provides a theoretical analysis of the expressiveness of transformers with and without CoT, showing that CoT significantly enhances their ability to solve complex problems that require serial computation.
Theoretical analysis reveals that constant-depth transformers with constant-bit precision can only solve problems in AC⁰, a subset of TC⁰. However, with T steps of CoT, constant-depth transformers with O(log n) embedding size can solve any problem solvable by boolean circuits of size T. This suggests that CoT allows transformers to perform more serial computations, making them more expressive.
Empirically, the paper evaluates the performance of transformers on four tasks: modular addition, permutation composition, iterated squaring, and circuit value problem. Results show that CoT significantly improves accuracy on tasks that are hard for parallel computation, especially for low-depth transformers. For example, CoT enables transformers to solve permutation composition and iterated squaring tasks that are difficult for parallel computation.
The paper also defines a new complexity class, CoT[T(n), d(n), s(n), e(n)], which corresponds to a problem class solvable by constant-depth, constant-precision decoder-only transformers with T(n) steps of CoT, O(d(n)) embedding size, and floating-point numbers with O(e(n)) bits of exponents and O(s(n)) bits of significand. Theoretical results show that allowing polynomial steps of CoT makes transformers with bounded depth and precision strictly more powerful.
Overall, the paper demonstrates that CoT significantly enhances the expressiveness of decoder-only transformers, enabling them to solve complex problems that require serial computation. This has important implications for the design and application of transformers in tasks that require sequential reasoning.This paper investigates the theoretical and empirical capabilities of Chain of Thought (CoT) prompting for decoder-only transformers. CoT enables the model to generate intermediate steps, which allows it to perform inherently serial computations that are otherwise not possible with standard transformers. The paper provides a theoretical analysis of the expressiveness of transformers with and without CoT, showing that CoT significantly enhances their ability to solve complex problems that require serial computation.
Theoretical analysis reveals that constant-depth transformers with constant-bit precision can only solve problems in AC⁰, a subset of TC⁰. However, with T steps of CoT, constant-depth transformers with O(log n) embedding size can solve any problem solvable by boolean circuits of size T. This suggests that CoT allows transformers to perform more serial computations, making them more expressive.
Empirically, the paper evaluates the performance of transformers on four tasks: modular addition, permutation composition, iterated squaring, and circuit value problem. Results show that CoT significantly improves accuracy on tasks that are hard for parallel computation, especially for low-depth transformers. For example, CoT enables transformers to solve permutation composition and iterated squaring tasks that are difficult for parallel computation.
The paper also defines a new complexity class, CoT[T(n), d(n), s(n), e(n)], which corresponds to a problem class solvable by constant-depth, constant-precision decoder-only transformers with T(n) steps of CoT, O(d(n)) embedding size, and floating-point numbers with O(e(n)) bits of exponents and O(s(n)) bits of significand. Theoretical results show that allowing polynomial steps of CoT makes transformers with bounded depth and precision strictly more powerful.
Overall, the paper demonstrates that CoT significantly enhances the expressiveness of decoder-only transformers, enabling them to solve complex problems that require serial computation. This has important implications for the design and application of transformers in tasks that require sequential reasoning.