7 Oct 2021 | Junnan Li, Ramprasaath R. Selvaraju, Akhilesh D. Gotmare, Shafiq Joty, Caiming Xiong, Steven C.H. Hoi
ALBEF is a new framework for vision-language representation learning that aligns image and text representations before fusing them. The method uses a contrastive loss to align image and text representations through cross-modal attention, enabling more grounded vision and language representation learning. Unlike existing methods, ALBEF does not require bounding box annotations or high-resolution images. To improve learning from noisy web data, the method proposes momentum distillation, a self-training method that learns from pseudo-targets produced by a momentum model. Theoretical analysis shows that ALBEF maximizes mutual information between different views of an image-text pair, allowing different training tasks to be interpreted as different ways to generate views. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks, outperforming methods pre-trained on larger datasets. On image-text retrieval, ALBEF outperforms CLIP and ALIGN. On VQA and NLVR, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and models are available at https://github.com/salesforce/ALBEF.ALBEF is a new framework for vision-language representation learning that aligns image and text representations before fusing them. The method uses a contrastive loss to align image and text representations through cross-modal attention, enabling more grounded vision and language representation learning. Unlike existing methods, ALBEF does not require bounding box annotations or high-resolution images. To improve learning from noisy web data, the method proposes momentum distillation, a self-training method that learns from pseudo-targets produced by a momentum model. Theoretical analysis shows that ALBEF maximizes mutual information between different views of an image-text pair, allowing different training tasks to be interpreted as different ways to generate views. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks, outperforming methods pre-trained on larger datasets. On image-text retrieval, ALBEF outperforms CLIP and ALIGN. On VQA and NLVR, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and models are available at https://github.com/salesforce/ALBEF.