**Abstract:**
This paper introduces Contrastive Captioner (CoCa), a minimalist design for pretraining an image-text encoder-decoder foundation model using contrastive and captioning losses. CoCa combines the strengths of contrastive approaches like CLIP and generative methods like SimVLM, achieving state-of-the-art performance on various downstream tasks with minimal task-specific adaptation.
**Key Contributions:**
1. **Design:** CoCa decouples the decoder into unimodal and multimodal components, encoding unimodal text representations and cascading multimodal layers for image-text representations.
2. **Losses:** It applies a contrastive loss between unimodal image and text embeddings and a captioning loss on the multimodal decoder outputs, both computed efficiently with minimal overhead.
3. **Training:** CoCa is pre-trained end-to-end on web-scale alt-text data and annotated images, treating all labels as text, unifying natural language supervision for representation learning.
**Empirical Results:**
- **Zero-shot Transfer:** CoCa achieves top-1 accuracy of 86.3% on ImageNet with zero-shot transfer.
- **Task-specific Adaptation:** With a frozen encoder and learned classification head, it achieves 90.6% accuracy on ImageNet.
- **State-of-the-art:** It achieves 91.0% top-1 accuracy on ImageNet with a finetuned encoder, outperforming other models on visual recognition, crossmodal retrieval, multimodal understanding, and image captioning tasks.
**Conclusion:**
CoCa unifies single-encoder, dual-encoder, and encoder-decoder paradigms, providing a versatile foundation model capable of handling a wide range of vision and vision-language tasks with minimal training and adaptation.**Abstract:**
This paper introduces Contrastive Captioner (CoCa), a minimalist design for pretraining an image-text encoder-decoder foundation model using contrastive and captioning losses. CoCa combines the strengths of contrastive approaches like CLIP and generative methods like SimVLM, achieving state-of-the-art performance on various downstream tasks with minimal task-specific adaptation.
**Key Contributions:**
1. **Design:** CoCa decouples the decoder into unimodal and multimodal components, encoding unimodal text representations and cascading multimodal layers for image-text representations.
2. **Losses:** It applies a contrastive loss between unimodal image and text embeddings and a captioning loss on the multimodal decoder outputs, both computed efficiently with minimal overhead.
3. **Training:** CoCa is pre-trained end-to-end on web-scale alt-text data and annotated images, treating all labels as text, unifying natural language supervision for representation learning.
**Empirical Results:**
- **Zero-shot Transfer:** CoCa achieves top-1 accuracy of 86.3% on ImageNet with zero-shot transfer.
- **Task-specific Adaptation:** With a frozen encoder and learned classification head, it achieves 90.6% accuracy on ImageNet.
- **State-of-the-art:** It achieves 91.0% top-1 accuracy on ImageNet with a finetuned encoder, outperforming other models on visual recognition, crossmodal retrieval, multimodal understanding, and image captioning tasks.
**Conclusion:**
CoCa unifies single-encoder, dual-encoder, and encoder-decoder paradigms, providing a versatile foundation model capable of handling a wide range of vision and vision-language tasks with minimal training and adaptation.