15 Jun 2021 | TIANYANG LIN, YUXIN WANG, XIANGYANG LIU, and XIPENG QIU*
This survey provides a comprehensive review of various Transformer variants, known as X-formers, which have been proposed to improve the vanilla Transformer architecture in different aspects. The authors first introduce the vanilla Transformer, including its architecture, key components, and usage. They then propose a new taxonomy of X-formers based on three perspectives: architectural modifications, pre-training methods, and applications. The survey focuses on architectural modifications, discussing attention-related variants, linearized attention, query prototyping, memory compression, low-rank self-attention, attention with prior, and other related techniques. The authors also briefly review pre-trained models and applications of Transformers in various fields. The goal is to provide a systematic and comprehensive overview of the diverse X-formers, highlighting their contributions and potential future research directions.This survey provides a comprehensive review of various Transformer variants, known as X-formers, which have been proposed to improve the vanilla Transformer architecture in different aspects. The authors first introduce the vanilla Transformer, including its architecture, key components, and usage. They then propose a new taxonomy of X-formers based on three perspectives: architectural modifications, pre-training methods, and applications. The survey focuses on architectural modifications, discussing attention-related variants, linearized attention, query prototyping, memory compression, low-rank self-attention, attention with prior, and other related techniques. The authors also briefly review pre-trained models and applications of Transformers in various fields. The goal is to provide a systematic and comprehensive overview of the diverse X-formers, highlighting their contributions and potential future research directions.