Understanding Transformers%2C parallel computation%2C and logarithmic depth

This paper explores the computational capabilities of transformers, highlighting their efficiency in solving basic computational tasks through parallel computation. The authors demonstrate that a constant number of self-attention layers can efficiently simulate and be simulated by a constant number of communication rounds in Massively Parallel Computation (MPC). This establishes that logarithmic depth is sufficient for transformers to solve tasks that cannot be efficiently solved by other neural sequence models or sub-quadratic transformer approximations. The key contribution is the identification of parallelism as a distinguishing property of transformers. The paper advances the understanding of transformers' representational capabilities by showing that any MPC protocol can be implemented by a transformer of linear depth and that any depth-$L$ transformer can be simulated by an MPC protocol with $O(L)$ rounds. This leads to tight upper and lower bounds on the depth of bounded-size transformers for computing connected components in graphs. Additionally, the authors study a specific sequential modeling task called the $k$-hop induction heads task, which generalizes the standard induction heads task. They show that this task can be solved by a transformer of depth $O(\log k)$, while other architectures like recurrent neural networks and graph neural networks require significantly more depth. Empirical results support these theoretical findings, showing that deeper transformers are better suited for this task. The paper also provides lower bounds on the parameter complexity of graph neural networks and recurrent neural architectures for solving graph connectivity and the $k$-hop task, demonstrating that these architectures cannot solve these tasks as efficiently as transformers. Furthermore, it shows that sub-quadratic attention mechanisms and single-layer transformers with chain-of-thought tokens lose their efficiency in performing parallel computation under a logarithmic-depth scaling. Overall, the paper underscores the importance of parallelism in transformers and provides theoretical and empirical evidence for their superior computational capabilities.This paper explores the computational capabilities of transformers, highlighting their efficiency in solving basic computational tasks through parallel computation. The authors demonstrate that a constant number of self-attention layers can efficiently simulate and be simulated by a constant number of communication rounds in Massively Parallel Computation (MPC). This establishes that logarithmic depth is sufficient for transformers to solve tasks that cannot be efficiently solved by other neural sequence models or sub-quadratic transformer approximations. The key contribution is the identification of parallelism as a distinguishing property of transformers. The paper advances the understanding of transformers' representational capabilities by showing that any MPC protocol can be implemented by a transformer of linear depth and that any depth-$L$ transformer can be simulated by an MPC protocol with $O(L)$ rounds. This leads to tight upper and lower bounds on the depth of bounded-size transformers for computing connected components in graphs. Additionally, the authors study a specific sequential modeling task called the $k$-hop induction heads task, which generalizes the standard induction heads task. They show that this task can be solved by a transformer of depth $O(\log k)$, while other architectures like recurrent neural networks and graph neural networks require significantly more depth. Empirical results support these theoretical findings, showing that deeper transformers are better suited for this task. The paper also provides lower bounds on the parameter complexity of graph neural networks and recurrent neural architectures for solving graph connectivity and the $k$-hop task, demonstrating that these architectures cannot solve these tasks as efficiently as transformers. Furthermore, it shows that sub-quadratic attention mechanisms and single-layer transformers with chain-of-thought tokens lose their efficiency in performing parallel computation under a logarithmic-depth scaling. Overall, the paper underscores the importance of parallelism in transformers and provides theoretical and empirical evidence for their superior computational capabilities.

Transformers, parallel computation, and logarithmic depth

February 15, 2024 | Clayton Sanford, Daniel Hsu, and Matus Telgarsky