10 Jul 2023 | Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao Fellow, IEEE
A survey on visual transformers explores their application in computer vision, highlighting their performance and potential. Transformers, originally used in natural language processing, are now being applied to tasks like image classification, object detection, segmentation, and video processing. They outperform traditional CNNs and RNNs in many benchmarks, offering high performance with less need for vision-specific biases. The paper categorizes vision transformers into backbone networks, high/mid-level vision, low-level vision, and video processing. It also discusses efficient transformer methods for real-world applications and the self-attention mechanism, which is central to transformers. Challenges and future research directions are outlined. The paper reviews various transformer-based models, including ViT, DeiT, and DETR, and discusses their performance, efficiency, and applications in different vision tasks. It also addresses issues like training stability, data augmentation, and self-supervised learning for vision transformers. The survey highlights the potential of transformers in computer vision and suggests further research in improving their effectiveness and efficiency.A survey on visual transformers explores their application in computer vision, highlighting their performance and potential. Transformers, originally used in natural language processing, are now being applied to tasks like image classification, object detection, segmentation, and video processing. They outperform traditional CNNs and RNNs in many benchmarks, offering high performance with less need for vision-specific biases. The paper categorizes vision transformers into backbone networks, high/mid-level vision, low-level vision, and video processing. It also discusses efficient transformer methods for real-world applications and the self-attention mechanism, which is central to transformers. Challenges and future research directions are outlined. The paper reviews various transformer-based models, including ViT, DeiT, and DETR, and discusses their performance, efficiency, and applications in different vision tasks. It also addresses issues like training stability, data augmentation, and self-supervised learning for vision transformers. The survey highlights the potential of transformers in computer vision and suggests further research in improving their effectiveness and efficiency.