February 28, 2024 | Binghui Peng, Srini Narayanan, Christos Papadimitriou
The paper explores the limitations of the Transformer architecture, particularly its inability to perform function composition, a critical task for language understanding. Using communication complexity, the authors prove that a single Transformer layer cannot correctly compute function composition for large enough domains. They show that even for small domains, Transformers struggle with this task, which is essential for tasks like genealogy and logical reasoning. The paper also highlights that Transformers cannot solve certain mathematical tasks that are central to compositional tasks, assuming well-established conjectures in computational complexity.
The authors demonstrate that Transformers have difficulty with function composition, which is a fundamental operation in both relational data and language understanding. They show that the inability of Transformers to handle function composition is rooted in the nature of the softmax computation, which limits the amount of non-local information they can process.
The paper also discusses the limitations of Chain of Thought (CoT) prompting, showing that even with CoT, Transformers require a large number of steps to solve complex compositional tasks. They prove that the number of CoT steps needed grows with the depth of the composition. Additionally, the paper shows that Transformers struggle with tasks requiring sequential composition of elementary tasks, such as arithmetic and logical puzzles, and that this difficulty increases with the complexity of the task.
The authors also discuss the computational complexity of Transformers, showing that they belong to a weak complexity class, logspace-uniform TC0. This implies that Transformers have limited memory and computational capabilities, which restricts their ability to handle complex tasks. The paper concludes that Transformers are fundamentally limited in their ability to perform certain tasks, and that these limitations are rooted in the architecture's design and the nature of the softmax computation.The paper explores the limitations of the Transformer architecture, particularly its inability to perform function composition, a critical task for language understanding. Using communication complexity, the authors prove that a single Transformer layer cannot correctly compute function composition for large enough domains. They show that even for small domains, Transformers struggle with this task, which is essential for tasks like genealogy and logical reasoning. The paper also highlights that Transformers cannot solve certain mathematical tasks that are central to compositional tasks, assuming well-established conjectures in computational complexity.
The authors demonstrate that Transformers have difficulty with function composition, which is a fundamental operation in both relational data and language understanding. They show that the inability of Transformers to handle function composition is rooted in the nature of the softmax computation, which limits the amount of non-local information they can process.
The paper also discusses the limitations of Chain of Thought (CoT) prompting, showing that even with CoT, Transformers require a large number of steps to solve complex compositional tasks. They prove that the number of CoT steps needed grows with the depth of the composition. Additionally, the paper shows that Transformers struggle with tasks requiring sequential composition of elementary tasks, such as arithmetic and logical puzzles, and that this difficulty increases with the complexity of the task.
The authors also discuss the computational complexity of Transformers, showing that they belong to a weak complexity class, logspace-uniform TC0. This implies that Transformers have limited memory and computational capabilities, which restricts their ability to handle complex tasks. The paper concludes that Transformers are fundamentally limited in their ability to perform certain tasks, and that these limitations are rooted in the architecture's design and the nature of the softmax computation.