Understanding Align before Fuse%3A Vision and Language Representation Learning with Momentum Distillation

The paper "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation" introduces ALBEF, a novel framework for vision-language representation learning. ALBEF addresses the limitations of existing methods by aligning image and text representations before fusing them through cross-modal attention. This approach enables more grounded vision and language representation learning, without requiring bounding box annotations or high-resolution images. The paper proposes Momentum Distillation (MoD), a self-training method that uses pseudo-targets generated by a momentum model to improve learning from noisy web data. The theoretical analysis of ALBEF from a mutual information maximization perspective shows that different training tasks can be interpreted as generating views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks, outperforming methods pre-trained on larger datasets and achieving faster inference speed. The paper also provides a detailed description of the model architecture, pre-training objectives, and implementation details, along with experimental results and ablation studies.The paper "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation" introduces ALBEF, a novel framework for vision-language representation learning. ALBEF addresses the limitations of existing methods by aligning image and text representations before fusing them through cross-modal attention. This approach enables more grounded vision and language representation learning, without requiring bounding box annotations or high-resolution images. The paper proposes Momentum Distillation (MoD), a self-training method that uses pseudo-targets generated by a momentum model to improve learning from noisy web data. The theoretical analysis of ALBEF from a mutual information maximization perspective shows that different training tasks can be interpreted as generating views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks, outperforming methods pre-trained on larger datasets and achieving faster inference speed. The paper also provides a detailed description of the model architecture, pre-training objectives, and implementation details, along with experimental results and ablation studies.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

7 Oct 2021 | Junnan Li, Ramprasaath R. Selvaraju, Akhilesh D. Gotmare, Shafiq Joty, Caiming Xiong, Steven C.H. Hoi