CoCa is a contrastive captioner model that combines contrastive and captioning losses to pretrain an image-text encoder-decoder foundation model. It subsumes capabilities from contrastive approaches like CLIP and generative methods like SimVLM. Unlike standard encoder-decoder transformers, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations and cascades the remaining decoder layers to cross-attend to the image encoder for multimodal representations. A contrastive loss between unimodal image and text embeddings and a captioning loss on multimodal decoder outputs are applied. CoCa is pretrained end-to-end on web-scale alt-text data and annotated images, treating all labels as text. It achieves state-of-the-art performance on various downstream tasks, including visual recognition, crossmodal retrieval, multimodal understanding, and image captioning. On ImageNet, CoCa achieves 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder, and 91.0% with a fine-tuned encoder. CoCa is efficient, with minimal overhead, and can be used for zero-shot transfer or minimal task adaptation. It outperforms specialized models in multiple tasks, demonstrating the effectiveness of combining contrastive and generative objectives in a unified model.CoCa is a contrastive captioner model that combines contrastive and captioning losses to pretrain an image-text encoder-decoder foundation model. It subsumes capabilities from contrastive approaches like CLIP and generative methods like SimVLM. Unlike standard encoder-decoder transformers, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations and cascades the remaining decoder layers to cross-attend to the image encoder for multimodal representations. A contrastive loss between unimodal image and text embeddings and a captioning loss on multimodal decoder outputs are applied. CoCa is pretrained end-to-end on web-scale alt-text data and annotated images, treating all labels as text. It achieves state-of-the-art performance on various downstream tasks, including visual recognition, crossmodal retrieval, multimodal understanding, and image captioning. On ImageNet, CoCa achieves 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder, and 91.0% with a fine-tuned encoder. CoCa is efficient, with minimal overhead, and can be used for zero-shot transfer or minimal task adaptation. It outperforms specialized models in multiple tasks, demonstrating the effectiveness of combining contrastive and generative objectives in a unified model.