24 May 2021 | Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, Armand Joulin
This paper investigates the properties of self-supervised learning in Vision Transformers (ViT) and demonstrates that self-supervised ViT features exhibit unique characteristics compared to supervised ViT and convolutional networks (convnets). The study highlights that self-supervised ViT features contain explicit semantic segmentation information and perform well as k-NN classifiers, achieving 78.3% top-1 accuracy on ImageNet with a small ViT. The research also emphasizes the importance of momentum encoder, multi-crop training, and small patches in improving ViT feature quality. A self-supervised method called DINO is introduced, which can be interpreted as a form of self-distillation without labels. DINO achieves 80.1% top-1 accuracy on ImageNet in linear evaluation with ViT-Base.
The study shows that DINO works well with both convnets and ViTs, and that its performance is not dependent on the architecture. DINO is flexible and can be applied to various tasks, including image retrieval, object discovery, and transfer learning. The features learned by DINO are effective for tasks such as nearest neighbor search, retaining information about object location, and transferability to downstream tasks. The study also demonstrates that DINO features outperform other self-supervised methods in terms of performance and efficiency. The results show that DINO achieves high accuracy with small patch sizes and low computational requirements, making it a promising approach for self-supervised learning in vision tasks.This paper investigates the properties of self-supervised learning in Vision Transformers (ViT) and demonstrates that self-supervised ViT features exhibit unique characteristics compared to supervised ViT and convolutional networks (convnets). The study highlights that self-supervised ViT features contain explicit semantic segmentation information and perform well as k-NN classifiers, achieving 78.3% top-1 accuracy on ImageNet with a small ViT. The research also emphasizes the importance of momentum encoder, multi-crop training, and small patches in improving ViT feature quality. A self-supervised method called DINO is introduced, which can be interpreted as a form of self-distillation without labels. DINO achieves 80.1% top-1 accuracy on ImageNet in linear evaluation with ViT-Base.
The study shows that DINO works well with both convnets and ViTs, and that its performance is not dependent on the architecture. DINO is flexible and can be applied to various tasks, including image retrieval, object discovery, and transfer learning. The features learned by DINO are effective for tasks such as nearest neighbor search, retaining information about object location, and transferability to downstream tasks. The study also demonstrates that DINO features outperform other self-supervised methods in terms of performance and efficiency. The results show that DINO achieves high accuracy with small patch sizes and low computational requirements, making it a promising approach for self-supervised learning in vision tasks.