Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

31 Aug 2020 | Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret
Linear transformers significantly reduce the memory and computational cost of traditional transformers by using a kernel-based formulation of self-attention and leveraging the associativity of matrix products. This approach reduces the time and memory complexity from O(N²) to O(N), where N is the sequence length. The model achieves this by expressing self-attention as a linear dot-product of kernel feature maps, enabling an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks (RNNs). Linear transformers match the performance of vanilla transformers but are up to 4000x faster on autoregressive prediction of very long sequences. The paper introduces a linear transformer model that scales linearly with the context length. By using a kernel-based formulation of self-attention and the associative property of matrix products, the model computes self-attention weights efficiently. This formulation also allows for causal masking with linear complexity and constant memory, revealing the connection between transformers and RNNs. This enables autoregressive inference to be performed orders of magnitude faster. Experiments on image generation and automatic speech recognition demonstrate that linear transformers can achieve performance levels comparable to transformers while being significantly faster during inference. The model's linear attention mechanism reduces computational and memory requirements, making it suitable for long sequences. The paper also shows that linear transformers can be expressed as RNNs, enabling efficient autoregressive inference. The linear transformer model is evaluated on synthetic tasks, image generation, and automatic speech recognition. Results show that linear transformers converge stably and achieve lower loss than other methods. They also require significantly less GPU memory and computation compared to traditional transformers. The model's performance is comparable to the state-of-the-art transformer architectures, with significant improvements in speed and efficiency. The paper concludes that linear transformers offer a promising approach for improving the efficiency of transformer models while maintaining their performance.Linear transformers significantly reduce the memory and computational cost of traditional transformers by using a kernel-based formulation of self-attention and leveraging the associativity of matrix products. This approach reduces the time and memory complexity from O(N²) to O(N), where N is the sequence length. The model achieves this by expressing self-attention as a linear dot-product of kernel feature maps, enabling an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks (RNNs). Linear transformers match the performance of vanilla transformers but are up to 4000x faster on autoregressive prediction of very long sequences. The paper introduces a linear transformer model that scales linearly with the context length. By using a kernel-based formulation of self-attention and the associative property of matrix products, the model computes self-attention weights efficiently. This formulation also allows for causal masking with linear complexity and constant memory, revealing the connection between transformers and RNNs. This enables autoregressive inference to be performed orders of magnitude faster. Experiments on image generation and automatic speech recognition demonstrate that linear transformers can achieve performance levels comparable to transformers while being significantly faster during inference. The model's linear attention mechanism reduces computational and memory requirements, making it suitable for long sequences. The paper also shows that linear transformers can be expressed as RNNs, enabling efficient autoregressive inference. The linear transformer model is evaluated on synthetic tasks, image generation, and automatic speech recognition. Results show that linear transformers converge stably and achieve lower loss than other methods. They also require significantly less GPU memory and computation compared to traditional transformers. The model's performance is comparable to the state-of-the-art transformer architectures, with significant improvements in speed and efficiency. The paper concludes that linear transformers offer a promising approach for improving the efficiency of transformer models while maintaining their performance.
Reach us at info@study.space
[slides and audio] Transformers are RNNs%3A Fast Autoregressive Transformers with Linear Attention