Do Efficient Transformers Really Save Computation?

Do Efficient Transformers Really Save Computation?

21 Feb 2024 | Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang
Do Efficient Transformers Really Save Computation? This paper investigates whether efficient Transformers, such as the Sparse Transformer and Linear Transformer, can truly reduce computational costs compared to standard Transformers. While many efficient Transformers have been proposed, they lack theoretical guarantees of efficiency. The authors analyze the reasoning capabilities of these models, modeling reasoning as a dynamic programming (DP) problem. They find that although these models are expressive enough to solve general DP tasks, their computational complexity scales with the problem size, similar to standard Transformers. However, for certain DP problems with strong locality, these efficient Transformers can be more efficient. The authors show that for tasks like arithmetic expression evaluation, the Sparse Transformer can achieve high accuracy with a constant hidden dimension, while the Linear Transformer requires a larger hidden dimension. They also introduce the concept of locality, where each step in the reasoning process depends only on recent steps, which allows efficient Transformers to perform better. Experiments on tasks such as Longest Increasing Subsequence (LIS), Edit Distance (ED), and Arithmetic demonstrate that efficient Transformers require larger hidden dimensions as problem size increases, but they can still outperform standard Transformers in tasks with strong locality. Theoretical analysis shows that for regular DP problems, efficient Transformers require a hidden dimension that scales with the square root of the problem size. However, when locality is present, the required hidden dimension can be significantly reduced. The results suggest that while efficient Transformers may not always be more efficient, they can be effective in specific scenarios, particularly those with strong locality. The paper highlights the importance of understanding the theoretical limitations and practical strengths of efficient Transformers.Do Efficient Transformers Really Save Computation? This paper investigates whether efficient Transformers, such as the Sparse Transformer and Linear Transformer, can truly reduce computational costs compared to standard Transformers. While many efficient Transformers have been proposed, they lack theoretical guarantees of efficiency. The authors analyze the reasoning capabilities of these models, modeling reasoning as a dynamic programming (DP) problem. They find that although these models are expressive enough to solve general DP tasks, their computational complexity scales with the problem size, similar to standard Transformers. However, for certain DP problems with strong locality, these efficient Transformers can be more efficient. The authors show that for tasks like arithmetic expression evaluation, the Sparse Transformer can achieve high accuracy with a constant hidden dimension, while the Linear Transformer requires a larger hidden dimension. They also introduce the concept of locality, where each step in the reasoning process depends only on recent steps, which allows efficient Transformers to perform better. Experiments on tasks such as Longest Increasing Subsequence (LIS), Edit Distance (ED), and Arithmetic demonstrate that efficient Transformers require larger hidden dimensions as problem size increases, but they can still outperform standard Transformers in tasks with strong locality. Theoretical analysis shows that for regular DP problems, efficient Transformers require a hidden dimension that scales with the square root of the problem size. However, when locality is present, the required hidden dimension can be significantly reduced. The results suggest that while efficient Transformers may not always be more efficient, they can be effective in specific scenarios, particularly those with strong locality. The paper highlights the importance of understanding the theoretical limitations and practical strengths of efficient Transformers.
Reach us at info@study.space
[slides and audio] Do Efficient Transformers Really Save Computation%3F