31 Aug 2020 | Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret
Transformers, while achieving remarkable performance in various tasks, suffer from quadratic complexity with respect to the input sequence length, making them slow for very long sequences. To address this, the paper introduces *linear transformers*, which express self-attention as a linear dot-product of kernel feature maps, reducing complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$, where $N$ is the sequence length. This formulation allows for an iterative implementation that significantly accelerates autoregressive transformers and reveals their relationship to recurrent neural networks (RNNs). The linear transformers achieve similar performance to vanilla transformers but are up to 4000 times faster on autoregressive prediction of very long sequences. The paper also demonstrates that the linear transformer can be used for image generation and automatic speech recognition, achieving competitive performance with significantly reduced computational and memory requirements.Transformers, while achieving remarkable performance in various tasks, suffer from quadratic complexity with respect to the input sequence length, making them slow for very long sequences. To address this, the paper introduces *linear transformers*, which express self-attention as a linear dot-product of kernel feature maps, reducing complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$, where $N$ is the sequence length. This formulation allows for an iterative implementation that significantly accelerates autoregressive transformers and reveals their relationship to recurrent neural networks (RNNs). The linear transformers achieve similar performance to vanilla transformers but are up to 4000 times faster on autoregressive prediction of very long sequences. The paper also demonstrates that the linear transformer can be used for image generation and automatic speech recognition, achieving competitive performance with significantly reduced computational and memory requirements.