[slides and audio] CoCa%3A Contrastive Captioners are Image-Text Foundation Models

**Abstract:** This paper introduces Contrastive Captioner (CoCa), a minimalist design for pretraining an image-text encoder-decoder foundation model using contrastive and captioning losses. CoCa combines the strengths of contrastive approaches like CLIP and generative methods like SimVLM, achieving state-of-the-art performance on various downstream tasks with minimal task-specific adaptation. **Key Contributions:** 1. **Design:** CoCa decouples the decoder into unimodal and multimodal components, encoding unimodal text representations and cascading multimodal layers for image-text representations. 2. **Losses:** It applies a contrastive loss between unimodal image and text embeddings and a captioning loss on the multimodal decoder outputs, both computed efficiently with minimal overhead. 3. **Training:** CoCa is pre-trained end-to-end on web-scale alt-text data and annotated images, treating all labels as text, unifying natural language supervision for representation learning. **Empirical Results:** - **Zero-shot Transfer:** CoCa achieves top-1 accuracy of 86.3% on ImageNet with zero-shot transfer. - **Task-specific Adaptation:** With a frozen encoder and learned classification head, it achieves 90.6% accuracy on ImageNet. - **State-of-the-art:** It achieves 91.0% top-1 accuracy on ImageNet with a finetuned encoder, outperforming other models on visual recognition, crossmodal retrieval, multimodal understanding, and image captioning tasks. **Conclusion:** CoCa unifies single-encoder, dual-encoder, and encoder-decoder paradigms, providing a versatile foundation model capable of handling a wide range of vision and vision-language tasks with minimal training and adaptation.**Abstract:** This paper introduces Contrastive Captioner (CoCa), a minimalist design for pretraining an image-text encoder-decoder foundation model using contrastive and captioning losses. CoCa combines the strengths of contrastive approaches like CLIP and generative methods like SimVLM, achieving state-of-the-art performance on various downstream tasks with minimal task-specific adaptation. **Key Contributions:** 1. **Design:** CoCa decouples the decoder into unimodal and multimodal components, encoding unimodal text representations and cascading multimodal layers for image-text representations. 2. **Losses:** It applies a contrastive loss between unimodal image and text embeddings and a captioning loss on the multimodal decoder outputs, both computed efficiently with minimal overhead. 3. **Training:** CoCa is pre-trained end-to-end on web-scale alt-text data and annotated images, treating all labels as text, unifying natural language supervision for representation learning. **Empirical Results:** - **Zero-shot Transfer:** CoCa achieves top-1 accuracy of 86.3% on ImageNet with zero-shot transfer. - **Task-specific Adaptation:** With a frozen encoder and learned classification head, it achieves 90.6% accuracy on ImageNet. - **State-of-the-art:** It achieves 91.0% top-1 accuracy on ImageNet with a finetuned encoder, outperforming other models on visual recognition, crossmodal retrieval, multimodal understanding, and image captioning tasks. **Conclusion:** CoCa unifies single-encoder, dual-encoder, and encoder-decoder paradigms, providing a versatile foundation model capable of handling a wide range of vision and vision-language tasks with minimal training and adaptation.

CoCa: Contrastive Captioners are Image-Text Foundation Models

14 Jun 2022 | Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu