No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

8 Apr 2024 | Vishaal Udanarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H.S. Torr, Adel Bibi, Samuel Albanie, Matthias Bethge
The paper investigates the relationship between the frequency of concepts in pretraining datasets and the performance of multimodal models on downstream tasks. It finds that multimodal models do not exhibit "zero-shot" generalization but instead require exponentially more data to achieve linear improvements in performance, following a log-linear scaling trend. This trend is robust across different datasets and synthetic data. The study also reveals that pretraining datasets have long-tailed concept distributions, with many concepts being rare. This leads to significant misalignment between image and text modalities in pretraining data. The authors introduce the "Let It Wag!" benchmark to test models on long-tailed concepts, finding that all tested models underperform on this dataset. The results suggest that current multimodal models are not truly capable of "zero-shot" generalization and highlight the need for more sample-efficient learning strategies. The study contributes to the understanding of how pretraining data influences model performance and emphasizes the importance of addressing long-tailed distributions in pretraining data.The paper investigates the relationship between the frequency of concepts in pretraining datasets and the performance of multimodal models on downstream tasks. It finds that multimodal models do not exhibit "zero-shot" generalization but instead require exponentially more data to achieve linear improvements in performance, following a log-linear scaling trend. This trend is robust across different datasets and synthetic data. The study also reveals that pretraining datasets have long-tailed concept distributions, with many concepts being rare. This leads to significant misalignment between image and text modalities in pretraining data. The authors introduce the "Let It Wag!" benchmark to test models on long-tailed concepts, finding that all tested models underperform on this dataset. The results suggest that current multimodal models are not truly capable of "zero-shot" generalization and highlight the need for more sample-efficient learning strategies. The study contributes to the understanding of how pretraining data influences model performance and emphasizes the importance of addressing long-tailed distributions in pretraining data.
Reach us at info@study.space
[slides] No %22Zero-Shot%22 Without Exponential Data%3A Pretraining Concept Frequency Determines Multimodal Model Performance | StudySpace