CoCa: Contrastive Captioners are Image-Text Foundation Models

CoCa: Contrastive Captioners are Image-Text Foundation Models

14 Jun 2022 | Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu
CoCa is a contrastive captioner model that combines contrastive and captioning losses to pretrain an image-text encoder-decoder foundation model. It subsumes capabilities from contrastive approaches like CLIP and generative methods like SimVLM. Unlike standard encoder-decoder transformers, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations and cascades the remaining decoder layers to cross-attend to the image encoder for multimodal representations. A contrastive loss between unimodal image and text embeddings and a captioning loss on multimodal decoder outputs are applied. CoCa is pretrained end-to-end on web-scale alt-text data and annotated images, treating all labels as text. It achieves state-of-the-art performance on various downstream tasks, including visual recognition, crossmodal retrieval, multimodal understanding, and image captioning. On ImageNet, CoCa achieves 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder, and 91.0% with a fine-tuned encoder. CoCa is efficient, with minimal overhead, and can be used for zero-shot transfer or minimal task adaptation. It outperforms specialized models in multiple tasks, demonstrating the effectiveness of combining contrastive and generative objectives in a unified model.CoCa is a contrastive captioner model that combines contrastive and captioning losses to pretrain an image-text encoder-decoder foundation model. It subsumes capabilities from contrastive approaches like CLIP and generative methods like SimVLM. Unlike standard encoder-decoder transformers, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations and cascades the remaining decoder layers to cross-attend to the image encoder for multimodal representations. A contrastive loss between unimodal image and text embeddings and a captioning loss on multimodal decoder outputs are applied. CoCa is pretrained end-to-end on web-scale alt-text data and annotated images, treating all labels as text. It achieves state-of-the-art performance on various downstream tasks, including visual recognition, crossmodal retrieval, multimodal understanding, and image captioning. On ImageNet, CoCa achieves 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder, and 91.0% with a fine-tuned encoder. CoCa is efficient, with minimal overhead, and can be used for zero-shot transfer or minimal task adaptation. It outperforms specialized models in multiple tasks, demonstrating the effectiveness of combining contrastive and generative objectives in a unified model.
Reach us at info@study.space