[slides] Modeling Caption Diversity in Contrastive Vision-Language Pretraining

This paper introduces Llip (Latent Language Image Pretraining), a method that addresses the limitation of CLIP (Contrastive Language-Image Pretraining) by modeling the diversity of captions that can match an image. Unlike CLIP, which maps an image and its caption to a single vector, Llip outputs a set of visual features that are mixed into a final representation conditioned on text information. This approach allows Llip to capture the rich and diverse nature of visual inputs, where multiple valid captions can describe the same image from different perspectives. The key innovation in Llip is the use of a cross-attention mechanism to infer the weights for mixing the visual mixture tokens based on the text caption. This enables Llip to produce different visual representations for different contexts, leading to richer and more expressive visual representations. The authors demonstrate that Llip outperforms non-contextualized baselines like CLIP and SigLIP on various zero-shot classification and retrieval tasks. Specifically, Llip achieves a top-1 accuracy of 83.5% on ImageNet, outperforming a similarly sized CLIP by 1.4%, and improves zero-shot retrieval on MS-COCO by 6.0%. The paper also provides a comprehensive analysis of the components of Llip, including the number of mixture tokens and the temperature of the softmax in the cross-attention module. The results show that increasing the number of mixture tokens and adjusting the softmax temperature can further enhance performance. Overall, Llip demonstrates significant improvements in both the quality of visual representations and downstream task performance, making it a promising approach for vision-language pretraining.This paper introduces Llip (Latent Language Image Pretraining), a method that addresses the limitation of CLIP (Contrastive Language-Image Pretraining) by modeling the diversity of captions that can match an image. Unlike CLIP, which maps an image and its caption to a single vector, Llip outputs a set of visual features that are mixed into a final representation conditioned on text information. This approach allows Llip to capture the rich and diverse nature of visual inputs, where multiple valid captions can describe the same image from different perspectives. The key innovation in Llip is the use of a cross-attention mechanism to infer the weights for mixing the visual mixture tokens based on the text caption. This enables Llip to produce different visual representations for different contexts, leading to richer and more expressive visual representations. The authors demonstrate that Llip outperforms non-contextualized baselines like CLIP and SigLIP on various zero-shot classification and retrieval tasks. Specifically, Llip achieves a top-1 accuracy of 83.5% on ImageNet, outperforming a similarly sized CLIP by 1.4%, and improves zero-shot retrieval on MS-COCO by 6.0%. The paper also provides a comprehensive analysis of the components of Llip, including the number of mixture tokens and the temperature of the softmax in the cross-attention module. The results show that increasing the number of mixture tokens and adjusting the softmax temperature can further enhance performance. Overall, Llip demonstrates significant improvements in both the quality of visual representations and downstream task performance, making it a promising approach for vision-language pretraining.

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

29 Mar 2025 | Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mahmoud Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas