28 Jun 2024 | Huy V. Vo, Vasil Khalidov, Timothée Darcret, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Courrieu, Maxime Oquab, Armand Joulin, Hervé Jégou, Patrick Labatut, Piotr Bojanowski
This paper addresses the challenge of automatic curation of high-quality datasets for self-supervised learning (SSL). The authors propose a clustering-based approach to build large, diverse, and balanced datasets, which are essential for effective SSL. The method involves applying hierarchical $k$-means to a large data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on web-based images, satellite images, and text domains demonstrate that features trained on automatically curated datasets outperform those trained on uncurated or manually curated data. The paper also discusses the limitations of traditional $k$-means and introduces a modified version, hierarchical $k$-means, which addresses these limitations by forming larger clusters for dominant concepts and smaller clusters for less frequent concepts, leading to more balanced and uniform distributions. The proposed method is evaluated on various benchmarks, showing significant improvements in accuracy, robustness, and generalization.This paper addresses the challenge of automatic curation of high-quality datasets for self-supervised learning (SSL). The authors propose a clustering-based approach to build large, diverse, and balanced datasets, which are essential for effective SSL. The method involves applying hierarchical $k$-means to a large data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on web-based images, satellite images, and text domains demonstrate that features trained on automatically curated datasets outperform those trained on uncurated or manually curated data. The paper also discusses the limitations of traditional $k$-means and introduces a modified version, hierarchical $k$-means, which addresses these limitations by forming larger clusters for dominant concepts and smaller clusters for less frequent concepts, leading to more balanced and uniform distributions. The proposed method is evaluated on various benchmarks, showing significant improvements in accuracy, robustness, and generalization.