CvT: Introducing Convolutions to Vision Transformers

CvT: Introducing Convolutions to Vision Transformers

29 Mar 2021 | Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang
This paper introduces a new architecture called Convolutional Vision Transformer (CvT) that improves the performance and efficiency of Vision Transformers (ViT) by integrating convolutions into the ViT design. The key innovations include a hierarchical structure of Transformers with a new convolutional token embedding and a convolutional Transformer block that uses a convolutional projection. These modifications bring desirable properties of convolutional neural networks (CNNs) to ViT, such as shift, scale, and distortion invariance, while retaining the advantages of Transformers, including dynamic attention and global context. The CvT model achieves state-of-the-art performance on ImageNet-1k with fewer parameters and lower FLOPs compared to other Vision Transformers and ResNets. It also maintains performance gains when pretrained on larger datasets like ImageNet-22k and fine-tuned for downstream tasks. The CvT-W24 model, pretrained on ImageNet-22k, achieves a top-1 accuracy of 87.7% on the ImageNet-1k validation set. Additionally, the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in the CvT model, simplifying the design for higher resolution vision tasks. The model's design allows for efficient computation and memory usage, and it outperforms both CNN-based and Transformer-based models in terms of performance and efficiency. The CvT architecture is evaluated on various downstream tasks and shows strong performance, demonstrating its effectiveness in image classification and transfer learning. The results show that the CvT model can achieve state-of-the-art performance while being lightweight and efficient. The code for the CvT model is available at https://github.com/leoxiaobin/CvT.This paper introduces a new architecture called Convolutional Vision Transformer (CvT) that improves the performance and efficiency of Vision Transformers (ViT) by integrating convolutions into the ViT design. The key innovations include a hierarchical structure of Transformers with a new convolutional token embedding and a convolutional Transformer block that uses a convolutional projection. These modifications bring desirable properties of convolutional neural networks (CNNs) to ViT, such as shift, scale, and distortion invariance, while retaining the advantages of Transformers, including dynamic attention and global context. The CvT model achieves state-of-the-art performance on ImageNet-1k with fewer parameters and lower FLOPs compared to other Vision Transformers and ResNets. It also maintains performance gains when pretrained on larger datasets like ImageNet-22k and fine-tuned for downstream tasks. The CvT-W24 model, pretrained on ImageNet-22k, achieves a top-1 accuracy of 87.7% on the ImageNet-1k validation set. Additionally, the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in the CvT model, simplifying the design for higher resolution vision tasks. The model's design allows for efficient computation and memory usage, and it outperforms both CNN-based and Transformer-based models in terms of performance and efficiency. The CvT architecture is evaluated on various downstream tasks and shows strong performance, demonstrating its effectiveness in image classification and transfer learning. The results show that the CvT model can achieve state-of-the-art performance while being lightweight and efficient. The code for the CvT model is available at https://github.com/leoxiaobin/CvT.
Reach us at info@study.space
[slides and audio] CvT%3A Introducing Convolutions to Vision Transformers