[slides] LXMERT%3A Learning Cross-Modality Encoder Representations from Transformers

LXMERT: Learning Cross-Modality Encoder Representations from Transformers **Authors:** Hao Tan, Mohit Bansal **Institution:** UNC Chapel Hill **Emails:** {haotan, mbansal}@cs.unc.edu **Abstract:** This paper introduces LXMERT, a framework for learning cross-modal encoder representations using Transformers. LXMERT aims to understand visual concepts, language semantics, and the alignment between these two modalities. The model consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. LXMERT is pre-trained on large-scale image-sentence pairs using five diverse tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help learn both intra-modality and cross-modality relationships. After fine-tuning, LXMERT achieves state-of-the-art results on two visual question answering datasets (VQA and GQA) and improves the previous best result by 22% on the challenging NLVR² dataset. **Model Architecture:** LXMERT uses self-attention and cross-attention layers to process inputs (image and sentence). The input embedding layers convert inputs into word-level sentence embeddings and object-level image embeddings. The language encoder and object-relationship encoder focus on single modalities, while the cross-modality encoder learns joint representations by exchanging information between modalities. **Pre-Training Strategies:** LXMERT is pre-trained on a large aggregated dataset of image-and-sentence pairs using five tasks: masked cross-modality language modeling, masked object prediction, cross-modality matching, and image question answering. The pre-training data includes MS COCO, Visual Genome, VQA v2.0, GQA, and VG-QA. **Experimental Results:** LXMERT outperforms previous methods on VQA, GQA, and NLVR² datasets. It improves accuracy by 2.1% on VQA, 3.2% on GQA, and 22% on NLVR². Ablation studies and attention visualizations further validate the effectiveness of LXMERT's components and pre-training strategies. **Conclusion:** LXMERT is a novel cross-modality framework that learns strong representations for vision-and-language tasks. It achieves state-of-the-art results and demonstrates the importance of cross-modality pre-training and model architecture design.LXMERT: Learning Cross-Modality Encoder Representations from Transformers **Authors:** Hao Tan, Mohit Bansal **Institution:** UNC Chapel Hill **Emails:** {haotan, mbansal}@cs.unc.edu **Abstract:** This paper introduces LXMERT, a framework for learning cross-modal encoder representations using Transformers. LXMERT aims to understand visual concepts, language semantics, and the alignment between these two modalities. The model consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder. LXMERT is pre-trained on large-scale image-sentence pairs using five diverse tasks: masked language modeling, masked object prediction (feature regression and label classification), cross-modality matching, and image question answering. These tasks help learn both intra-modality and cross-modality relationships. After fine-tuning, LXMERT achieves state-of-the-art results on two visual question answering datasets (VQA and GQA) and improves the previous best result by 22% on the challenging NLVR² dataset. **Model Architecture:** LXMERT uses self-attention and cross-attention layers to process inputs (image and sentence). The input embedding layers convert inputs into word-level sentence embeddings and object-level image embeddings. The language encoder and object-relationship encoder focus on single modalities, while the cross-modality encoder learns joint representations by exchanging information between modalities. **Pre-Training Strategies:** LXMERT is pre-trained on a large aggregated dataset of image-and-sentence pairs using five tasks: masked cross-modality language modeling, masked object prediction, cross-modality matching, and image question answering. The pre-training data includes MS COCO, Visual Genome, VQA v2.0, GQA, and VG-QA. **Experimental Results:** LXMERT outperforms previous methods on VQA, GQA, and NLVR² datasets. It improves accuracy by 2.1% on VQA, 3.2% on GQA, and 22% on NLVR². Ablation studies and attention visualizations further validate the effectiveness of LXMERT's components and pre-training strategies. **Conclusion:** LXMERT is a novel cross-modality framework that learns strong representations for vision-and-language tasks. It achieves state-of-the-art results and demonstrates the importance of cross-modality pre-training and model architecture design.

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

3 Dec 2019 | Hao Tan Mohit Bansal