EFFECTIVE PRUNING OF WEB-SCALE DATASETS BASED ON COMPLEXITY OF CONCEPT CLUSTERS

EFFECTIVE PRUNING OF WEB-SCALE DATASETS BASED ON COMPLEXITY OF CONCEPT CLUSTERS

2024 | Amro Abbas, Evgenia Rusak, Kushal Tirumala, Wieland Brendel, Kamalika Chaudhuri, Ari S. Morcos
This paper presents a method for effectively pruning large-scale web-scale datasets to train CLIP-style models, reducing training costs while maintaining or improving performance. The authors propose Density-Based Pruning (DBP), which extends the Self-Supervised-Prototypes Pruning (SSP-Pruning) method to web-scale datasets. DBP uses a complexity measure based on cluster inter- and intra-distance to determine how many samples to keep from each cluster, allowing for more diverse and less redundant data. The method is applied to the LAION dataset, where it reduces training cost to 27.7% while achieving better performance than training on the full dataset. The approach also transfers well to the DataComp benchmark, outperforming existing methods on multiple tasks. The results show that training on a smaller, high-quality dataset can lead to better performance with significantly lower training costs. The method is evaluated on various tasks including ImageNet zero-shot accuracy, ImageNet distribution shifts, retrieval, and VTAB, demonstrating its effectiveness in improving model performance while reducing computational requirements. The paper also discusses related work in data curation and contrastive image-language pre-training, highlighting the importance of efficient data selection in training large-scale models.This paper presents a method for effectively pruning large-scale web-scale datasets to train CLIP-style models, reducing training costs while maintaining or improving performance. The authors propose Density-Based Pruning (DBP), which extends the Self-Supervised-Prototypes Pruning (SSP-Pruning) method to web-scale datasets. DBP uses a complexity measure based on cluster inter- and intra-distance to determine how many samples to keep from each cluster, allowing for more diverse and less redundant data. The method is applied to the LAION dataset, where it reduces training cost to 27.7% while achieving better performance than training on the full dataset. The approach also transfers well to the DataComp benchmark, outperforming existing methods on multiple tasks. The results show that training on a smaller, high-quality dataset can lead to better performance with significantly lower training costs. The method is evaluated on various tasks including ImageNet zero-shot accuracy, ImageNet distribution shifts, retrieval, and VTAB, demonstrating its effectiveness in improving model performance while reducing computational requirements. The paper also discusses related work in data curation and contrastive image-language pre-training, highlighting the importance of efficient data selection in training large-scale models.
Reach us at info@study.space
[slides and audio] Effective pruning of web-scale datasets based on complexity of concept clusters