Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

April 1, 2025 | Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mahmoud Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas
This paper introduces Llip, a contrastive vision-language pre-training model that models the diversity of captions that could match an image. Unlike CLIP, which maps an image and its caption to a single vector, Llip outputs a set of visual features that are mixed into a final representation based on information derived from the text. This approach allows Llip to better capture the richness of visual input by considering multiple possible text descriptions of an image. The paper shows that Llip outperforms non-contextualized baselines like CLIP and SigLIP on various tasks, including zero-shot classification and retrieval. Llip improves zero-shot classification by an average of 2.9% with a ViT-G/14 encoder, achieving a top-1 accuracy of 83.5% on ImageNet, outperforming a similarly sized CLIP by 1.4%. It also improves zero-shot retrieval on MS-COCO by 6.0%. The paper provides a comprehensive analysis of the components introduced by the method and demonstrates that Llip leads to richer visual representations. The approach involves a visual encoder that outputs K visual mixture components and a cross-attention module that selects how to weight the different mixture components based on the text representation. The paper also includes an extensive experimental analysis comparing Llip to other contrastive VLP methods on various benchmarks, showing that Llip consistently outperforms the baselines. The results indicate that the contextualization of visual features with the target caption leads to significant improvements in visual representation quality and downstream performance.This paper introduces Llip, a contrastive vision-language pre-training model that models the diversity of captions that could match an image. Unlike CLIP, which maps an image and its caption to a single vector, Llip outputs a set of visual features that are mixed into a final representation based on information derived from the text. This approach allows Llip to better capture the richness of visual input by considering multiple possible text descriptions of an image. The paper shows that Llip outperforms non-contextualized baselines like CLIP and SigLIP on various tasks, including zero-shot classification and retrieval. Llip improves zero-shot classification by an average of 2.9% with a ViT-G/14 encoder, achieving a top-1 accuracy of 83.5% on ImageNet, outperforming a similarly sized CLIP by 1.4%. It also improves zero-shot retrieval on MS-COCO by 6.0%. The paper provides a comprehensive analysis of the components introduced by the method and demonstrates that Llip leads to richer visual representations. The approach involves a visual encoder that outputs K visual mixture components and a cross-attention module that selects how to weight the different mixture components based on the text representation. The paper also includes an extensive experimental analysis comparing Llip to other contrastive VLP methods on various benchmarks, showing that Llip consistently outperforms the baselines. The results indicate that the contextualization of visual features with the target caption leads to significant improvements in visual representation quality and downstream performance.
Reach us at info@study.space
[slides] Modeling Caption Diversity in Contrastive Vision-Language Pretraining | StudySpace