22 May 2024 | Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong
Vision-language models (VLMs) exhibit imbalanced performance due to long-tailed concept distributions in their pretraining data. This paper introduces Retrieval-Augmented Learning (REAL) to address this issue. REAL improves zero-shot recognition by using frequent synonyms of visual concepts in prompts and retrieving balanced pretraining data for linear classifier training. REAL-Prompt replaces original class names with their most frequent synonyms, outperforming existing prompting methods. REAL-Linear retrieves balanced pretraining data to train a robust linear classifier, achieving state-of-the-art performance with significantly reduced computational resources. The study shows that VLMs struggle with rare concepts, and REAL enhances performance across various benchmarks. The method also improves image generation for rare concepts using text-to-image models. The analysis reveals that long-tailed concept distributions in pretraining data lead to biased VLM performance, and REAL effectively mitigates this issue. The paper highlights the importance of addressing concept frequency imbalances in VLMs to improve downstream applications.Vision-language models (VLMs) exhibit imbalanced performance due to long-tailed concept distributions in their pretraining data. This paper introduces Retrieval-Augmented Learning (REAL) to address this issue. REAL improves zero-shot recognition by using frequent synonyms of visual concepts in prompts and retrieving balanced pretraining data for linear classifier training. REAL-Prompt replaces original class names with their most frequent synonyms, outperforming existing prompting methods. REAL-Linear retrieves balanced pretraining data to train a robust linear classifier, achieving state-of-the-art performance with significantly reduced computational resources. The study shows that VLMs struggle with rare concepts, and REAL enhances performance across various benchmarks. The method also improves image generation for rare concepts using text-to-image models. The analysis reveals that long-tailed concept distributions in pretraining data lead to biased VLM performance, and REAL effectively mitigates this issue. The paper highlights the importance of addressing concept frequency imbalances in VLMs to improve downstream applications.