[slides] Efficient Transformers%3A A Survey

This article presents a comprehensive survey of efficient Transformer models, which aim to improve the computational and memory efficiency of the original Transformer architecture. The original Transformer model, introduced in 2017, uses self-attention mechanisms that have quadratic time and memory complexity, which can be prohibitive for long sequences. To address this, researchers have developed various efficient Transformer models that reduce the complexity of self-attention while maintaining performance. The survey categorizes these models into different types, including fixed patterns, combination of patterns, learnable patterns, neural memory, low-rank methods, kernels, recurrence, downsampling, and sparse models. Each of these approaches aims to reduce the computational and memory costs of the self-attention mechanism in different ways, such as by limiting the attention span to local neighborhoods, using sparse attention patterns, or employing low-rank approximations. The article discusses several key efficient Transformer models, including the Memory Compressed Transformer, Image Transformer, Set Transformer, Sparse Transformer, Axial Transformer, Longformer, ETC, BigBird, and Routing Transformer. These models demonstrate various techniques to improve efficiency, such as using local attention, sparse attention, or global memory tokens. The survey also highlights the trade-offs between efficiency and performance, noting that while some models reduce computational costs, they may also introduce additional complexity or limitations in certain applications. Overall, the survey provides a detailed overview of the latest advancements in efficient Transformer models, emphasizing their importance in enabling the use of Transformers for long sequences and large inputs. The article concludes by discussing the challenges and future directions in this area of research.This article presents a comprehensive survey of efficient Transformer models, which aim to improve the computational and memory efficiency of the original Transformer architecture. The original Transformer model, introduced in 2017, uses self-attention mechanisms that have quadratic time and memory complexity, which can be prohibitive for long sequences. To address this, researchers have developed various efficient Transformer models that reduce the complexity of self-attention while maintaining performance. The survey categorizes these models into different types, including fixed patterns, combination of patterns, learnable patterns, neural memory, low-rank methods, kernels, recurrence, downsampling, and sparse models. Each of these approaches aims to reduce the computational and memory costs of the self-attention mechanism in different ways, such as by limiting the attention span to local neighborhoods, using sparse attention patterns, or employing low-rank approximations. The article discusses several key efficient Transformer models, including the Memory Compressed Transformer, Image Transformer, Set Transformer, Sparse Transformer, Axial Transformer, Longformer, ETC, BigBird, and Routing Transformer. These models demonstrate various techniques to improve efficiency, such as using local attention, sparse attention, or global memory tokens. The survey also highlights the trade-offs between efficiency and performance, noting that while some models reduce computational costs, they may also introduce additional complexity or limitations in certain applications. Overall, the survey provides a detailed overview of the latest advancements in efficient Transformer models, emphasizing their importance in enabling the use of Transformers for long sequences and large inputs. The article concludes by discussing the challenges and future directions in this area of research.

Efficient Transformers: A Survey

14 Mar 2022 | Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler