15 Jun 2021 | TIANYANG LIN, YUXIN WANG, XIANGYANG LIU, and XIPENG QIU*
This survey provides a comprehensive review of various Transformer variants, known as X-formers. Transformers have achieved significant success in fields like natural language processing, computer vision, and audio processing. Despite this, a systematic review of X-formers is still missing. The survey introduces a new taxonomy of X-formers based on three perspectives: architectural modifications, pre-training, and applications. It discusses various X-formers, including those that improve model efficiency, generalization, and adaptation. The survey also covers the architecture of the vanilla Transformer, key components like attention modules and position-wise FFN, and the computational complexity of these components. It compares the Transformer to other network types, highlighting the advantages of self-attention. The survey categorizes X-formers into different types, including sparse attention, linearized attention, query prototyping, memory compression, low-rank self-attention, and attention with prior. It also discusses improved multi-head mechanisms and their applications. The survey concludes with a discussion of future research directions.This survey provides a comprehensive review of various Transformer variants, known as X-formers. Transformers have achieved significant success in fields like natural language processing, computer vision, and audio processing. Despite this, a systematic review of X-formers is still missing. The survey introduces a new taxonomy of X-formers based on three perspectives: architectural modifications, pre-training, and applications. It discusses various X-formers, including those that improve model efficiency, generalization, and adaptation. The survey also covers the architecture of the vanilla Transformer, key components like attention modules and position-wise FFN, and the computational complexity of these components. It compares the Transformer to other network types, highlighting the advantages of self-attention. The survey categorizes X-formers into different types, including sparse attention, linearized attention, query prototyping, memory compression, low-rank self-attention, and attention with prior. It also discusses improved multi-head mechanisms and their applications. The survey concludes with a discussion of future research directions.