21 Feb 2024 | Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang
The paper "Do Efficient Transformers Really Save Computation?" explores the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer, in solving Dynamic Programming (DP) problems. The authors aim to understand whether these efficient models can reduce computational complexity compared to standard Transformers. They model reasoning tasks, which are often solved using Chain-of-Thought (CoT) prompts, as Dynamic Programming problems and analyze the expressive power and efficiency of efficient Transformers in this context.
Key findings include:
1. **Expressiveness**: Both the Sparse Transformer and the Linear Transformer are expressive enough to solve general DP tasks, but their complexity scales with the problem size.
2. **Complexity**: Unlike standard Transformers, which have a constant complexity, efficient Transformers require a growing model size that scales with the problem size. Specifically, for both models, the hidden dimension must scale as \(\Omega(\sqrt{L})\) to solve DP problems efficiently.
3. **Efficiency in Specific Tasks**: The authors identify a class of DP problems, particularly those with high locality (where each step depends on recent substeps), for which efficient Transformers can be more efficient than standard Transformers. For example, the arithmetic evaluation task shows that efficient Transformers can achieve lower complexity.
4. **Experiments**: Extensive experiments on tasks like the Longest Increasing Subsequence (LIS), Edit Distance (ED), and Arithmetic evaluation confirm the theoretical findings, showing that efficient Transformers generally need larger hidden dimensions and that this requirement increases with problem size.
The paper concludes that while efficient Transformers can be more efficient in certain specific tasks, they do not consistently reduce overall computational complexity compared to standard Transformers. This highlights the need for further research to identify conditions under which efficient Transformers can be truly efficient.The paper "Do Efficient Transformers Really Save Computation?" explores the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer, in solving Dynamic Programming (DP) problems. The authors aim to understand whether these efficient models can reduce computational complexity compared to standard Transformers. They model reasoning tasks, which are often solved using Chain-of-Thought (CoT) prompts, as Dynamic Programming problems and analyze the expressive power and efficiency of efficient Transformers in this context.
Key findings include:
1. **Expressiveness**: Both the Sparse Transformer and the Linear Transformer are expressive enough to solve general DP tasks, but their complexity scales with the problem size.
2. **Complexity**: Unlike standard Transformers, which have a constant complexity, efficient Transformers require a growing model size that scales with the problem size. Specifically, for both models, the hidden dimension must scale as \(\Omega(\sqrt{L})\) to solve DP problems efficiently.
3. **Efficiency in Specific Tasks**: The authors identify a class of DP problems, particularly those with high locality (where each step depends on recent substeps), for which efficient Transformers can be more efficient than standard Transformers. For example, the arithmetic evaluation task shows that efficient Transformers can achieve lower complexity.
4. **Experiments**: Extensive experiments on tasks like the Longest Increasing Subsequence (LIS), Edit Distance (ED), and Arithmetic evaluation confirm the theoretical findings, showing that efficient Transformers generally need larger hidden dimensions and that this requirement increases with problem size.
The paper concludes that while efficient Transformers can be more efficient in certain specific tasks, they do not consistently reduce overall computational complexity compared to standard Transformers. This highlights the need for further research to identify conditions under which efficient Transformers can be truly efficient.