24 May 2021 | Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, Armand Joulin
This paper explores the potential of self-supervised learning in Vision Transformers (ViTs) and its impact on the features learned by these models. The authors find that self-supervised ViTs not only perform well in image classification but also exhibit unique properties that are not observed in supervised ViTs or convolutional networks (convnets). Specifically, self-supervised ViT features contain explicit information about semantic segmentation, which is not as clear in supervised models. These features also perform exceptionally well in k-NN classification, achieving 78.3% top-1 accuracy on ImageNet with a small ViT. The study highlights the importance of momentum encoder, multi-crop training, and small patches in ViTs. Based on these findings, the authors propose DINO, a simple self-supervised method that can be interpreted as a form of self-distillation without labels. DINO achieves 80.1% top-1 accuracy on ImageNet in linear evaluation with ViT-Base, demonstrating the synergy between DINO and ViTs. The paper also discusses the flexibility of DINO for both convnets and ViTs, and its potential for limited computational resources.This paper explores the potential of self-supervised learning in Vision Transformers (ViTs) and its impact on the features learned by these models. The authors find that self-supervised ViTs not only perform well in image classification but also exhibit unique properties that are not observed in supervised ViTs or convolutional networks (convnets). Specifically, self-supervised ViT features contain explicit information about semantic segmentation, which is not as clear in supervised models. These features also perform exceptionally well in k-NN classification, achieving 78.3% top-1 accuracy on ImageNet with a small ViT. The study highlights the importance of momentum encoder, multi-crop training, and small patches in ViTs. Based on these findings, the authors propose DINO, a simple self-supervised method that can be interpreted as a form of self-distillation without labels. DINO achieves 80.1% top-1 accuracy on ImageNet in linear evaluation with ViT-Base, demonstrating the synergy between DINO and ViTs. The paper also discusses the flexibility of DINO for both convnets and ViTs, and its potential for limited computational resources.