Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

11 Jun 2021 | Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig
This paper presents a method for scaling up visual and vision-language representation learning using a large-scale noisy image-alt-text dataset. The dataset, derived from the Conceptual Captions dataset, contains over one billion image-alt-text pairs and is obtained without expensive filtering or post-processing. A simple dual-encoder architecture is used to learn visual and language representations by aligning them through a contrastive loss. The model achieves strong performance on various tasks, including image classification and cross-modal retrieval. The model, named ALIGN, is trained using a contrastive loss that aligns image and text representations in a shared latent space. It outperforms previous methods in image-text retrieval tasks, including zero-shot classification and cross-modal search. ALIGN also achieves state-of-the-art results on benchmarks such as Flickr30K and MSCOCO. The model's visual representations are effective for downstream visual tasks, including image classification on ImageNet. The paper also explores the use of ALIGN for multilingual tasks, demonstrating its effectiveness on a multilingual image-text retrieval dataset. The model is trained on a multilingual dataset covering 100+ languages and achieves strong performance across multiple languages. The study highlights the benefits of using large-scale noisy data for training visual and vision-language representations, showing that such data can be effectively used to achieve high performance without the need for expensive curation. The model's performance is evaluated on various tasks, including image-text retrieval, zero-shot classification, and cross-modal search, demonstrating its versatility and effectiveness. The results show that ALIGN is a strong competitor to existing methods in vision-language representation learning.This paper presents a method for scaling up visual and vision-language representation learning using a large-scale noisy image-alt-text dataset. The dataset, derived from the Conceptual Captions dataset, contains over one billion image-alt-text pairs and is obtained without expensive filtering or post-processing. A simple dual-encoder architecture is used to learn visual and language representations by aligning them through a contrastive loss. The model achieves strong performance on various tasks, including image classification and cross-modal retrieval. The model, named ALIGN, is trained using a contrastive loss that aligns image and text representations in a shared latent space. It outperforms previous methods in image-text retrieval tasks, including zero-shot classification and cross-modal search. ALIGN also achieves state-of-the-art results on benchmarks such as Flickr30K and MSCOCO. The model's visual representations are effective for downstream visual tasks, including image classification on ImageNet. The paper also explores the use of ALIGN for multilingual tasks, demonstrating its effectiveness on a multilingual image-text retrieval dataset. The model is trained on a multilingual dataset covering 100+ languages and achieves strong performance across multiple languages. The study highlights the benefits of using large-scale noisy data for training visual and vision-language representations, showing that such data can be effectively used to achieve high performance without the need for expensive curation. The model's performance is evaluated on various tasks, including image-text retrieval, zero-shot classification, and cross-modal search, demonstrating its versatility and effectiveness. The results show that ALIGN is a strong competitor to existing methods in vision-language representation learning.
Reach us at info@study.space
[slides and audio] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision