[slides and audio] Scaling Vision Transformers

This paper explores the scaling properties of Vision Transformers (ViT) to understand how to effectively design future models. The authors scale both the model size and data scale, ranging from 5 million to 2 billion parameters and 1 million to 3 billion training images, respectively. They find that scaling up compute, model, and data together improves representation quality, but this improvement saturates at both ends of the spectrum: large models do not achieve zero error rate, and smaller models are limited by the dataset size. The study also reveals that larger models are more sample-efficient and perform better in few-shot learning. The authors refine the architecture and training process, reducing memory consumption and improving accuracy. As a result, they train a ViT model with 2 billion parameters, achieving a new state-of-the-art top-1 accuracy of 90.45% on ImageNet. The model also performs well in few-shot transfer, achieving 84.86% top-1 accuracy with only 10 examples per class on ImageNet. The paper contributes to the understanding of scaling laws for ViT models and provides a practical training recipe for large-scale ViT models.This paper explores the scaling properties of Vision Transformers (ViT) to understand how to effectively design future models. The authors scale both the model size and data scale, ranging from 5 million to 2 billion parameters and 1 million to 3 billion training images, respectively. They find that scaling up compute, model, and data together improves representation quality, but this improvement saturates at both ends of the spectrum: large models do not achieve zero error rate, and smaller models are limited by the dataset size. The study also reveals that larger models are more sample-efficient and perform better in few-shot learning. The authors refine the architecture and training process, reducing memory consumption and improving accuracy. As a result, they train a ViT model with 2 billion parameters, achieving a new state-of-the-art top-1 accuracy of 90.45% on ImageNet. The model also performs well in few-shot transfer, achieving 84.86% top-1 accuracy with only 10 examples per class on ImageNet. The paper contributes to the understanding of scaling laws for ViT models and provides a practical training recipe for large-scale ViT models.

Scaling Vision Transformers

20 Jun 2022 | Xiaohua Zhai*, Alexander Kolesnikov*, Neil Houlsby, Lucas Beyer*

20 Jun 2022 | Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, Lucas Beyer*