14 Jun 2020 | Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma
The paper "Linformer: Self-Attention with Linear Complexity" by Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma from Facebook AI introduces a novel approach to optimize the self-attention mechanism in Transformers, which is a key component of large language models. The standard self-attention mechanism has quadratic complexity in time and space, making it computationally expensive for long sequences. The authors demonstrate that the self-attention matrix can be approximated by a low-rank matrix, reducing the complexity to linear time and space. This is achieved by decomposing the scaled dot-product attention into multiple smaller attentions through linear projections, forming a low-rank factorization of the original attention matrix. The resulting model, called Linformer, performs on par with standard Transformers while being significantly more memory- and time-efficient. The paper includes theoretical analysis and empirical results showing that Linformer maintains or even slightly outperforms standard Transformers on various natural language processing tasks, including pretraining and downstream tasks, while achieving substantial speedups in training and inference.The paper "Linformer: Self-Attention with Linear Complexity" by Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma from Facebook AI introduces a novel approach to optimize the self-attention mechanism in Transformers, which is a key component of large language models. The standard self-attention mechanism has quadratic complexity in time and space, making it computationally expensive for long sequences. The authors demonstrate that the self-attention matrix can be approximated by a low-rank matrix, reducing the complexity to linear time and space. This is achieved by decomposing the scaled dot-product attention into multiple smaller attentions through linear projections, forming a low-rank factorization of the original attention matrix. The resulting model, called Linformer, performs on par with standard Transformers while being significantly more memory- and time-efficient. The paper includes theoretical analysis and empirical results showing that Linformer maintains or even slightly outperforms standard Transformers on various natural language processing tasks, including pretraining and downstream tasks, while achieving substantial speedups in training and inference.