No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

No “Zero-Shot” Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

8 Apr 2024 | Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H.S. Torr, Adel Bibi, Samuel Albanie, Matthias Bethge
The paper investigates the relationship between the frequency of concepts in pretraining datasets and the performance of multimodal models on downstream tasks, challenging the notion of "zero-shot" generalization. The authors analyze 34 models across five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics) and find that model performance scales linearly with the exponential increase in concept frequency, following a log-linear scaling trend. This trend holds even when controlling for sample-level similarity and testing on synthetic data distributions. The study also reveals that pretraining datasets have a long-tailed distribution of concept frequencies, with a significant degree of misalignment between concepts in image-text pairs. To address these issues, the authors introduce the "Let It Wag!" benchmark, which tests models on long-tailed data, and find that current models perform poorly, highlighting the need for better strategies to handle long-tailed distributions. The findings suggest that the key to achieving "zero-shot" generalization under large-scale training paradigms remains to be discovered.The paper investigates the relationship between the frequency of concepts in pretraining datasets and the performance of multimodal models on downstream tasks, challenging the notion of "zero-shot" generalization. The authors analyze 34 models across five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics) and find that model performance scales linearly with the exponential increase in concept frequency, following a log-linear scaling trend. This trend holds even when controlling for sample-level similarity and testing on synthetic data distributions. The study also reveals that pretraining datasets have a long-tailed distribution of concept frequencies, with a significant degree of misalignment between concepts in image-text pairs. To address these issues, the authors introduce the "Let It Wag!" benchmark, which tests models on long-tailed data, and find that current models perform poorly, highlighting the need for better strategies to handle long-tailed distributions. The findings suggest that the key to achieving "zero-shot" generalization under large-scale training paradigms remains to be discovered.
Reach us at info@study.space
[slides] No %22Zero-Shot%22 Without Exponential Data%3A Pretraining Concept Frequency Determines Multimodal Model Performance | StudySpace