March, 2021 | Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah
This survey provides a comprehensive overview of the application of Transformer models in computer vision. Transformers, originally developed for natural language processing, have shown great potential in vision tasks due to their ability to model long-range dependencies and support parallel processing. Unlike convolutional networks, Transformers require minimal inductive biases and are naturally suited for set functions. They can process multiple modalities and demonstrate excellent scalability to large networks and datasets. This survey covers various applications of Transformers in vision, including recognition tasks, generative modeling, multi-modal tasks, video processing, low-level vision, and 3D analysis. It compares the advantages and limitations of different techniques in terms of architectural design and experimental value. The survey also discusses open research directions and possible future works. The paper introduces the fundamental concepts behind the success of Transformers, such as self-attention, large-scale pre-training, and bidirectional feature encoding. It then elaborates on the specifics of recent vision Transformers, drawing parallels between NLP and vision Transformers to highlight novelties and domain-specific insights. The survey also discusses the foundations of Transformers, including self-attention, supervised pre-training, and the architecture of the Transformer model. It explores the use of self-attention in CNNs, the role of bidirectional representations, and the application of Transformers in object detection. The survey categorizes vision models with self-attention into two categories: those using single-head self-attention and those employing multi-head self-attention based Transformer modules. It discusses various Transformer-based vision architectures, including uniform-scale, multi-scale, and hybrid designs. The survey also covers self-supervised vision Transformers, which have gained significant success for CNN-based vision tasks. Finally, the survey discusses the application of Transformers for object detection, including detection Transformers with CNN backbones and purely transformer-based designs. The survey highlights the strengths and limitations of different approaches and provides insights into future research directions.This survey provides a comprehensive overview of the application of Transformer models in computer vision. Transformers, originally developed for natural language processing, have shown great potential in vision tasks due to their ability to model long-range dependencies and support parallel processing. Unlike convolutional networks, Transformers require minimal inductive biases and are naturally suited for set functions. They can process multiple modalities and demonstrate excellent scalability to large networks and datasets. This survey covers various applications of Transformers in vision, including recognition tasks, generative modeling, multi-modal tasks, video processing, low-level vision, and 3D analysis. It compares the advantages and limitations of different techniques in terms of architectural design and experimental value. The survey also discusses open research directions and possible future works. The paper introduces the fundamental concepts behind the success of Transformers, such as self-attention, large-scale pre-training, and bidirectional feature encoding. It then elaborates on the specifics of recent vision Transformers, drawing parallels between NLP and vision Transformers to highlight novelties and domain-specific insights. The survey also discusses the foundations of Transformers, including self-attention, supervised pre-training, and the architecture of the Transformer model. It explores the use of self-attention in CNNs, the role of bidirectional representations, and the application of Transformers in object detection. The survey categorizes vision models with self-attention into two categories: those using single-head self-attention and those employing multi-head self-attention based Transformer modules. It discusses various Transformer-based vision architectures, including uniform-scale, multi-scale, and hybrid designs. The survey also covers self-supervised vision Transformers, which have gained significant success for CNN-based vision tasks. Finally, the survey discusses the application of Transformers for object detection, including detection Transformers with CNN backbones and purely transformer-based designs. The survey highlights the strengths and limitations of different approaches and provides insights into future research directions.