01/2024 | Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski
DINOv2: Learning Robust Visual Features without Supervision
This paper presents DINOv2, a self-supervised method for learning robust visual features without supervision. The authors show that existing pretraining methods, especially self-supervised ones, can produce general-purpose visual features if trained on enough curated data from diverse sources. They revisit existing approaches and combine different techniques to scale their pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, they propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, they train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available general-purpose features, OpenCLIP, on most of the benchmarks at image and pixel levels.
The authors build an automatic pipeline to filter and rebalance datasets from an extensive collection of uncurated images. This pipeline is inspired by pipelines used in NLP, where data similarities are used instead of external metadata and do not require manual annotation. A major difficulty when dealing with images in the wild is to rebalance concepts and avoid overfitting on a few dominant modes. In this work, a naive clustering approach works reasonably well to resolve this issue. They gathered a small but diverse corpus of 142M images to validate their approach.
They provide a variety of pretrained visual models, called DINOv2, trained with different Vision Transformers (ViT) architectures on their data. They release all the models and the code to retrain DINOv2 on any data. They validate the quality of DINOv2 on various computer vision benchmarks at both image and pixel levels as they scale them, as summarized in Figure 2. They conclude that self-supervised pretraining alone is a good candidate for learning transferable frozen features that are competitive with the best openly available weakly-supervised models.
The authors also discuss related work, including intra-image self-supervised training, discriminative self-supervised learning, scaling self-supervised pretraining, and automatic data curation. They describe their data processing pipeline, which includes curated and uncurated data sources, image deduplication, and self-supervised image retrieval. They also describe their discriminative self-supervised pre-training method, which combines DINO and iBOT losses with the centering of SwAV. They add a regularizer to spread features and a short high-resolution training phase.
The authors also discuss efficient implementation, including fast and memory-efficient attention, sequence packing, efficient stochastic depth, and fully-sharded data parallel (FSDP). They also discuss model distillation, which allows smaller models to be trained from scratch or distilled from larger models. They present ablation studies showing the importance of various components of their pipeline, including the technical modificationsDINOv2: Learning Robust Visual Features without Supervision
This paper presents DINOv2, a self-supervised method for learning robust visual features without supervision. The authors show that existing pretraining methods, especially self-supervised ones, can produce general-purpose visual features if trained on enough curated data from diverse sources. They revisit existing approaches and combine different techniques to scale their pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, they propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, they train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available general-purpose features, OpenCLIP, on most of the benchmarks at image and pixel levels.
The authors build an automatic pipeline to filter and rebalance datasets from an extensive collection of uncurated images. This pipeline is inspired by pipelines used in NLP, where data similarities are used instead of external metadata and do not require manual annotation. A major difficulty when dealing with images in the wild is to rebalance concepts and avoid overfitting on a few dominant modes. In this work, a naive clustering approach works reasonably well to resolve this issue. They gathered a small but diverse corpus of 142M images to validate their approach.
They provide a variety of pretrained visual models, called DINOv2, trained with different Vision Transformers (ViT) architectures on their data. They release all the models and the code to retrain DINOv2 on any data. They validate the quality of DINOv2 on various computer vision benchmarks at both image and pixel levels as they scale them, as summarized in Figure 2. They conclude that self-supervised pretraining alone is a good candidate for learning transferable frozen features that are competitive with the best openly available weakly-supervised models.
The authors also discuss related work, including intra-image self-supervised training, discriminative self-supervised learning, scaling self-supervised pretraining, and automatic data curation. They describe their data processing pipeline, which includes curated and uncurated data sources, image deduplication, and self-supervised image retrieval. They also describe their discriminative self-supervised pre-training method, which combines DINO and iBOT losses with the centering of SwAV. They add a regularizer to spread features and a short high-resolution training phase.
The authors also discuss efficient implementation, including fast and memory-efficient attention, sequence packing, efficient stochastic depth, and fully-sharded data parallel (FSDP). They also discuss model distillation, which allows smaller models to be trained from scratch or distilled from larger models. They present ablation studies showing the importance of various components of their pipeline, including the technical modifications