14 Mar 2022 | Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
Efficient Transformers: A Survey
This paper provides a comprehensive overview of recent advancements in efficient Transformer models, which aim to improve the computational and memory efficiency of the original Transformer architecture. Transformers, known for their self-attention mechanism, have become central to many domains, including natural language processing, vision, and reinforcement learning. However, the quadratic time and memory complexity of self-attention poses challenges for processing long sequences. To address this, numerous "X-former" models have been proposed, each introducing various techniques to enhance efficiency, such as sparsity, compression, and pattern-based attention.
The paper categorizes these models into different types, including fixed patterns, sparse models, learnable patterns, low-rank methods, kernel-based approaches, recurrence-based models, and downsampling techniques. Each category is discussed in detail, highlighting their technical innovations, primary use cases, and performance characteristics. The survey also covers specific models such as Memory Compressed Transformer, Image Transformer, Set Transformer, Sparse Transformer, Axial Transformer, Longformer, ETC, BigBird, Routing Transformer, Reformer, and Sinkhorn Transformers.
Key findings include the effectiveness of various techniques in reducing computational and memory costs, while maintaining model performance. The paper emphasizes the importance of balancing efficiency with accuracy, noting that some methods explicitly aim to reduce sequence length, thereby saving computational resources. The survey also discusses the trade-offs between different approaches and highlights the growing trend of using sparse and learned patterns to achieve efficient self-attention mechanisms. Overall, the paper serves as a valuable resource for researchers and practitioners seeking to understand and implement efficient Transformer models.Efficient Transformers: A Survey
This paper provides a comprehensive overview of recent advancements in efficient Transformer models, which aim to improve the computational and memory efficiency of the original Transformer architecture. Transformers, known for their self-attention mechanism, have become central to many domains, including natural language processing, vision, and reinforcement learning. However, the quadratic time and memory complexity of self-attention poses challenges for processing long sequences. To address this, numerous "X-former" models have been proposed, each introducing various techniques to enhance efficiency, such as sparsity, compression, and pattern-based attention.
The paper categorizes these models into different types, including fixed patterns, sparse models, learnable patterns, low-rank methods, kernel-based approaches, recurrence-based models, and downsampling techniques. Each category is discussed in detail, highlighting their technical innovations, primary use cases, and performance characteristics. The survey also covers specific models such as Memory Compressed Transformer, Image Transformer, Set Transformer, Sparse Transformer, Axial Transformer, Longformer, ETC, BigBird, Routing Transformer, Reformer, and Sinkhorn Transformers.
Key findings include the effectiveness of various techniques in reducing computational and memory costs, while maintaining model performance. The paper emphasizes the importance of balancing efficiency with accuracy, noting that some methods explicitly aim to reduce sequence length, thereby saving computational resources. The survey also discusses the trade-offs between different approaches and highlights the growing trend of using sparse and learned patterns to achieve efficient self-attention mechanisms. Overall, the paper serves as a valuable resource for researchers and practitioners seeking to understand and implement efficient Transformer models.