[slides] CvT%3A Introducing Convolutions to Vision Transformers

The paper introduces a new architecture called Convolutional Vision Transformer (CvT), which combines the strengths of Vision Transformers (ViT) and Convolutional Neural Networks (CNNs) to improve performance and efficiency in image classification tasks. CvT achieves this by incorporating convolutions into the ViT architecture, specifically through a hierarchy of Transformers with convolutional token embeddings and a convolutional projection layer. These modifications allow CvT to capture local spatial contexts, maintain dynamic attention, and achieve better generalization while being more efficient in terms of parameters and FLOPs. Extensive experiments on ImageNet-1k and ImageNet-22k datasets demonstrate that CvT outperforms both ViT and ResNets, achieving state-of-the-art performance with fewer parameters and lower computational costs. Additionally, CvT can be fine-tuned to perform well on various downstream tasks, and it does not require positional encodings, simplifying its design for tasks with variable input resolutions.The paper introduces a new architecture called Convolutional Vision Transformer (CvT), which combines the strengths of Vision Transformers (ViT) and Convolutional Neural Networks (CNNs) to improve performance and efficiency in image classification tasks. CvT achieves this by incorporating convolutions into the ViT architecture, specifically through a hierarchy of Transformers with convolutional token embeddings and a convolutional projection layer. These modifications allow CvT to capture local spatial contexts, maintain dynamic attention, and achieve better generalization while being more efficient in terms of parameters and FLOPs. Extensive experiments on ImageNet-1k and ImageNet-22k datasets demonstrate that CvT outperforms both ViT and ResNets, achieving state-of-the-art performance with fewer parameters and lower computational costs. Additionally, CvT can be fine-tuned to perform well on various downstream tasks, and it does not require positional encodings, simplifying its design for tasks with variable input resolutions.

CvT: Introducing Convolutions to Vision Transformers

29 Mar 2021 | Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang