22 May 2024 | Shubham Parashar*1 Zhiqiu Lin*2 Tian Liu*1 Xiangjue Dong1 Yanan Li3 Deva Ramanan2 James Caverlee1 Shu Kong14†
The paper "The Neglected Tails in Vision-Language Models" addresses the issue of imbalanced performance in Vision-Language Models (VLMs) across different visual concepts. VLMs, while excelling in zero-shot recognition, often perform poorly on rare concepts due to their limited presence in pretraining datasets. The authors propose a method to estimate the frequency of visual concepts in VLMs' pretraining data using large language models (LLMs). This method reveals that popular datasets like LAION exhibit a long-tailed distribution of concepts, leading to biased performance in VLMs. The paper also finds that downstream applications of VLMs, such as visual chatbots and text-to-image models, often fail to recognize or generate images of rare concepts.
To mitigate this imbalance, the authors introduce Retrieval-Augmented Learning (REAL), which uses two variants: REAL-Prompt and REAL-Linear. REAL-Prompt replaces original class names with their most frequent synonyms found in pretraining texts, improving zero-shot recognition accuracy. REAL-Linear retrieves a balanced subset of pretraining data using concept synonyms and trains a robust linear classifier, outperforming previous methods with significantly less storage and training time.
The paper provides experimental results demonstrating the effectiveness of REAL on various benchmarks, showing improvements in both head and tail classes. It also discusses the broad societal impacts of addressing this issue and acknowledges limitations, such as the lack of groundtruth annotations and the time-consuming nature of filtering ambiguous captions.The paper "The Neglected Tails in Vision-Language Models" addresses the issue of imbalanced performance in Vision-Language Models (VLMs) across different visual concepts. VLMs, while excelling in zero-shot recognition, often perform poorly on rare concepts due to their limited presence in pretraining datasets. The authors propose a method to estimate the frequency of visual concepts in VLMs' pretraining data using large language models (LLMs). This method reveals that popular datasets like LAION exhibit a long-tailed distribution of concepts, leading to biased performance in VLMs. The paper also finds that downstream applications of VLMs, such as visual chatbots and text-to-image models, often fail to recognize or generate images of rare concepts.
To mitigate this imbalance, the authors introduce Retrieval-Augmented Learning (REAL), which uses two variants: REAL-Prompt and REAL-Linear. REAL-Prompt replaces original class names with their most frequent synonyms found in pretraining texts, improving zero-shot recognition accuracy. REAL-Linear retrieves a balanced subset of pretraining data using concept synonyms and trains a robust linear classifier, outperforming previous methods with significantly less storage and training time.
The paper provides experimental results demonstrating the effectiveness of REAL on various benchmarks, showing improvements in both head and tail classes. It also discusses the broad societal impacts of addressing this issue and acknowledges limitations, such as the lack of groundtruth annotations and the time-consuming nature of filtering ambiguous captions.