20 Jun 2022 | Xiaohua Zhai*, Alexander Kolesnikov*, Neil Houlsby, Lucas Beyer*
This paper presents a comprehensive study on the scaling properties of Vision Transformers (ViT) for image recognition tasks. The authors scale ViT models and data both up and down, and characterize the relationships between error rate, data, and compute. They refine the architecture and training of ViT, reducing memory consumption and increasing accuracy. As a result, they successfully train a ViT model with two billion parameters, achieving a new state-of-the-art top-1 accuracy of 90.45% on ImageNet. The model also performs well for few-shot transfer, reaching 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
The study shows that scaling up compute, model, and data together improves representation quality. However, representation quality can be bottlenecked by model size. Large models benefit from additional data, even beyond 1B images. The paper also identifies a double-saturating power law relationship between compute and performance, indicating that the performance of ViT models follows a power-law relationship with a saturation at both ends of the compute spectrum.
The authors also demonstrate that larger models are more sample efficient and are great few-shot learners. They present a new training recipe that allows efficient training of large and high-performing ViT models. The paper also discusses various improvements to the ViT model and training, including decoupled weight decay for the "head", saving memory by removing the [class] token, and using memory-efficient optimizers. The study also explores the effects of different learning-rate schedules and model dimensions on performance.
The results show that the performance-compute frontier for ViT models with enough training data roughly follows a (saturating) power law. Crucially, to stay on this frontier, one has to simultaneously scale compute and model size. The paper also highlights the importance of scaling laws in understanding the performance of ViT models and provides insights into the design of future generations of Vision Transformers.This paper presents a comprehensive study on the scaling properties of Vision Transformers (ViT) for image recognition tasks. The authors scale ViT models and data both up and down, and characterize the relationships between error rate, data, and compute. They refine the architecture and training of ViT, reducing memory consumption and increasing accuracy. As a result, they successfully train a ViT model with two billion parameters, achieving a new state-of-the-art top-1 accuracy of 90.45% on ImageNet. The model also performs well for few-shot transfer, reaching 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
The study shows that scaling up compute, model, and data together improves representation quality. However, representation quality can be bottlenecked by model size. Large models benefit from additional data, even beyond 1B images. The paper also identifies a double-saturating power law relationship between compute and performance, indicating that the performance of ViT models follows a power-law relationship with a saturation at both ends of the compute spectrum.
The authors also demonstrate that larger models are more sample efficient and are great few-shot learners. They present a new training recipe that allows efficient training of large and high-performing ViT models. The paper also discusses various improvements to the ViT model and training, including decoupled weight decay for the "head", saving memory by removing the [class] token, and using memory-efficient optimizers. The study also explores the effects of different learning-rate schedules and model dimensions on performance.
The results show that the performance-compute frontier for ViT models with enough training data roughly follows a (saturating) power law. Crucially, to stay on this frontier, one has to simultaneously scale compute and model size. The paper also highlights the importance of scaling laws in understanding the performance of ViT models and provides insights into the design of future generations of Vision Transformers.