Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

28 Jun 2024 | Huy V. Vo, Vasil Khalidov, Timothée Darce, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin, Hervé Jégou, Patrick Labatut, Piotr Bojanowski
Automatic data curation for self-supervised learning: a clustering-based approach. Self-supervised learning (SSL) is central to modern machine learning systems. SSL features are typically pre-trained on large data collections, but manual curation is costly and time-consuming. This work proposes a clustering-based method to automatically curate high-quality datasets for SSL pre-training. The method involves successive and hierarchical applications of k-means to a large data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Experiments on three domains (web-based images, satellite images, text) show that features trained on automatically curated datasets outperform those trained on uncurated data and are on par or better than those trained on manually curated data. The method ensures datasets are large, diverse, and balanced, which is essential for avoiding biases in SSL pre-training. The approach uses clustering-based methods to rebalance datasets with long-tailed distributions, leading to improved performance on benchmarks. The method is applied to web-based images, text, and satellite images, showing significant improvements in all domains. The approach is generic and agnostic to downstream tasks, allowing for the inference of interesting properties from uncurated data. The method is evaluated using DINOv2, a distillation-based approach, and shows large gains in benchmarks. The approach is also applied to text and satellite imaging data, demonstrating significant improvements in both domains. The method is effective in flattening data distributions, improving robustness, and out-of-distribution generalization. The approach is based on hierarchical k-means, which is shown to produce more balanced clusterings than standard k-means. The method is evaluated on various benchmarks, including ImageNet classification, out-of-distribution testing, long-tailed benchmarks, retrieval, fine-grained classification, and dense prediction. The results show that the method leads to significant improvements in all domains. The method is effective in balancing datasets with long-tailed distributions, leading to improved performance on benchmarks. The approach is generic and agnostic to downstream tasks, allowing for the inference of interesting properties from uncurated data. The method is evaluated on various benchmarks, including ImageNet classification, out-of-distribution testing, long-tailed benchmarks, retrieval, fine-grained classification, and dense prediction. The results show that the method leads to significant improvements in all domains.Automatic data curation for self-supervised learning: a clustering-based approach. Self-supervised learning (SSL) is central to modern machine learning systems. SSL features are typically pre-trained on large data collections, but manual curation is costly and time-consuming. This work proposes a clustering-based method to automatically curate high-quality datasets for SSL pre-training. The method involves successive and hierarchical applications of k-means to a large data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Experiments on three domains (web-based images, satellite images, text) show that features trained on automatically curated datasets outperform those trained on uncurated data and are on par or better than those trained on manually curated data. The method ensures datasets are large, diverse, and balanced, which is essential for avoiding biases in SSL pre-training. The approach uses clustering-based methods to rebalance datasets with long-tailed distributions, leading to improved performance on benchmarks. The method is applied to web-based images, text, and satellite images, showing significant improvements in all domains. The approach is generic and agnostic to downstream tasks, allowing for the inference of interesting properties from uncurated data. The method is evaluated using DINOv2, a distillation-based approach, and shows large gains in benchmarks. The approach is also applied to text and satellite imaging data, demonstrating significant improvements in both domains. The method is effective in flattening data distributions, improving robustness, and out-of-distribution generalization. The approach is based on hierarchical k-means, which is shown to produce more balanced clusterings than standard k-means. The method is evaluated on various benchmarks, including ImageNet classification, out-of-distribution testing, long-tailed benchmarks, retrieval, fine-grained classification, and dense prediction. The results show that the method leads to significant improvements in all domains. The method is effective in balancing datasets with long-tailed distributions, leading to improved performance on benchmarks. The approach is generic and agnostic to downstream tasks, allowing for the inference of interesting properties from uncurated data. The method is evaluated on various benchmarks, including ImageNet classification, out-of-distribution testing, long-tailed benchmarks, retrieval, fine-grained classification, and dense prediction. The results show that the method leads to significant improvements in all domains.
Reach us at info@study.space