11 Jun 2021 | Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig
This paper addresses the challenge of scaling visual and vision-language representation learning by leveraging a large, noisy dataset of over one billion image alt-text pairs. The dataset, obtained without expensive filtering or post-processing, is used to train a dual-encoder architecture that aligns visual and language representations using a contrastive loss. The resulting model, named ALIGN, achieves state-of-the-art performance on various tasks, including zero-shot image classification, cross-modal retrieval, and visual classification on datasets like ImageNet and VTAB. ALIGN outperforms previous methods, even those with more complex cross-attention models, and demonstrates strong generalization to novel concepts. The paper also includes an analysis of the learned embeddings and a multilingual extension of the model, showing its effectiveness in a broader range of applications.This paper addresses the challenge of scaling visual and vision-language representation learning by leveraging a large, noisy dataset of over one billion image alt-text pairs. The dataset, obtained without expensive filtering or post-processing, is used to train a dual-encoder architecture that aligns visual and language representations using a contrastive loss. The resulting model, named ALIGN, achieves state-of-the-art performance on various tasks, including zero-shot image classification, cross-modal retrieval, and visual classification on datasets like ImageNet and VTAB. ALIGN outperforms previous methods, even those with more complex cross-attention models, and demonstrates strong generalization to novel concepts. The paper also includes an analysis of the learned embeddings and a multilingual extension of the model, showing its effectiveness in a broader range of applications.