March, 2021 | Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah
This survey provides a comprehensive overview of Transformer models in computer vision, highlighting their applications and advancements. Transformers, known for their ability to model long dependencies and support parallel processing, have been adapted for various vision tasks, including image classification, object detection, action recognition, segmentation, and video processing. The survey begins with an introduction to the fundamental concepts of self-attention, large-scale pre-training, and bidirectional feature encoding. It then delves into extensive applications of Transformers in vision, comparing the advantages and limitations of different techniques. The survey also discusses open research directions and potential future works, aiming to inspire further research in the community. Key contributions include the development of Vision Transformers (ViTs) that replace convolutions with self-attention, multi-scale and hybrid designs that combine convolutional and Transformer operations, and self-supervised learning approaches for efficient training. The survey concludes with a detailed analysis of Transformer-based object detection methods, such as DETR and its variants, which have achieved state-of-the-art performance in various benchmarks.This survey provides a comprehensive overview of Transformer models in computer vision, highlighting their applications and advancements. Transformers, known for their ability to model long dependencies and support parallel processing, have been adapted for various vision tasks, including image classification, object detection, action recognition, segmentation, and video processing. The survey begins with an introduction to the fundamental concepts of self-attention, large-scale pre-training, and bidirectional feature encoding. It then delves into extensive applications of Transformers in vision, comparing the advantages and limitations of different techniques. The survey also discusses open research directions and potential future works, aiming to inspire further research in the community. Key contributions include the development of Vision Transformers (ViTs) that replace convolutions with self-attention, multi-scale and hybrid designs that combine convolutional and Transformer operations, and self-supervised learning approaches for efficient training. The survey concludes with a detailed analysis of Transformer-based object detection methods, such as DETR and its variants, which have achieved state-of-the-art performance in various benchmarks.