3 Jun 2021 | Alexey Dosovitskiy*, Lucas Beyer*, Alexander Kolesnikov*, Dirk Weissenborn*, Xiaohua Zhai*, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby*;
The paper introduces Vision Transformer (ViT), a pure transformer-based model for image recognition, which achieves excellent performance on image classification tasks when pre-trained on large datasets and fine-tuned for smaller benchmarks. Unlike traditional convolutional networks (CNNs), ViT treats image patches as tokens and applies self-attention directly to them. This approach outperforms state-of-the-art CNNs in terms of accuracy while requiring fewer computational resources for training. ViT achieves high accuracy on benchmarks such as ImageNet, CIFAR-100, and VTAB, with the best model reaching 88.55% on ImageNet and 94.55% on CIFAR-100.
ViT performs well when trained on large-scale datasets like ImageNet-21k and JFT-300M, demonstrating that large-scale pre-training can overcome the limitations of CNNs in terms of inductive bias. The model's performance is further enhanced when trained on larger datasets, showing that the size of the pre-training data significantly impacts the model's ability to generalize.
The paper also explores the use of hybrid models that combine CNN feature maps with ViT, and finds that these models can achieve competitive performance. Additionally, the study investigates the effectiveness of self-supervised pre-training for ViT, showing that it can achieve significant improvements in performance on ImageNet.
The results indicate that ViT is a scalable and efficient alternative to CNNs for image recognition, with the potential to be further improved through larger-scale pre-training and more advanced training techniques. The study highlights the importance of large-scale data in achieving state-of-the-art performance and suggests that future research should focus on exploring the scalability of ViT and improving its performance through self-supervised learning.The paper introduces Vision Transformer (ViT), a pure transformer-based model for image recognition, which achieves excellent performance on image classification tasks when pre-trained on large datasets and fine-tuned for smaller benchmarks. Unlike traditional convolutional networks (CNNs), ViT treats image patches as tokens and applies self-attention directly to them. This approach outperforms state-of-the-art CNNs in terms of accuracy while requiring fewer computational resources for training. ViT achieves high accuracy on benchmarks such as ImageNet, CIFAR-100, and VTAB, with the best model reaching 88.55% on ImageNet and 94.55% on CIFAR-100.
ViT performs well when trained on large-scale datasets like ImageNet-21k and JFT-300M, demonstrating that large-scale pre-training can overcome the limitations of CNNs in terms of inductive bias. The model's performance is further enhanced when trained on larger datasets, showing that the size of the pre-training data significantly impacts the model's ability to generalize.
The paper also explores the use of hybrid models that combine CNN feature maps with ViT, and finds that these models can achieve competitive performance. Additionally, the study investigates the effectiveness of self-supervised pre-training for ViT, showing that it can achieve significant improvements in performance on ImageNet.
The results indicate that ViT is a scalable and efficient alternative to CNNs for image recognition, with the potential to be further improved through larger-scale pre-training and more advanced training techniques. The study highlights the importance of large-scale data in achieving state-of-the-art performance and suggests that future research should focus on exploring the scalability of ViT and improving its performance through self-supervised learning.