12 Mar 2024 | Amro Abbas*,†, Evgenia Rusak 1*†, Kushal Tirumala2, Wieland Brendel3,4,5, Kamalika Chaudhuri2,6, Ari S. Morcos7†
This paper presents an effective pruning method for web-scale datasets to improve the training efficiency and performance of CLIP-style models. The authors scale the Self-Supervised-Prototypes Pruning (SSP-Pruning) method to large-scale multimodal datasets like LAION, a noisy and extensive dataset. They propose Density-Based Pruning (DBP), which adaptively adjusts the pruning rate based on the complexity of concepts within the dataset. By clustering the embeddings and calculating the inter-cluster and intra-cluster distances, DBP determines the complexity of each cluster and selects a representative subset of samples. This method reduces the training cost to a quarter of regular training while maintaining or improving performance. The authors demonstrate that their approach outperforms existing methods on ImageNet zero-shot accuracy and achieves state-of-the-art results on the DataComp Medium benchmark, showcasing the significant impact of optimized dataset pruning on machine learning models.This paper presents an effective pruning method for web-scale datasets to improve the training efficiency and performance of CLIP-style models. The authors scale the Self-Supervised-Prototypes Pruning (SSP-Pruning) method to large-scale multimodal datasets like LAION, a noisy and extensive dataset. They propose Density-Based Pruning (DBP), which adaptively adjusts the pruning rate based on the complexity of concepts within the dataset. By clustering the embeddings and calculating the inter-cluster and intra-cluster distances, DBP determines the complexity of each cluster and selects a representative subset of samples. This method reduces the training cost to a quarter of regular training while maintaining or improving performance. The authors demonstrate that their approach outperforms existing methods on ImageNet zero-shot accuracy and achieves state-of-the-art results on the DataComp Medium benchmark, showcasing the significant impact of optimized dataset pruning on machine learning models.