Understanding Unicoder-VL%3A A Universal Encoder for Vision and Language by Cross-modal Pre-training

**Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training** **Authors:** Gen Li, Fangyj, Nanduan, Migon, Djiang, Mingzhou **Affiliations:** School of Software & Microelectronics, Peking University, Beijing, China; Natural Language Computing, Microsoft Research Asia, Beijing, China; STCA NLP Group, Microsoft, Beijing, China **Abstract:** Unicoder-VL is a universal encoder designed to learn joint representations of vision and language through cross-modal pre-training. Inspired by cross-lingual pre-trained models like XLM and Unicoder, Unicoder-VL employs a multi-layer Transformer for cross-modal pre-training, using three tasks: Masked Language Modeling (MLM), Masked Object Classification (MOC), and Visual-linguistic Matching (VLM). These tasks help in learning context-aware representations for input tokens based on both visual and linguistic content. After pre-training on large-scale image-caption pairs, Unicoder-VL is fine-tuned for caption-based image-text retrieval and visual commonsense reasoning tasks, achieving state-of-the-art or comparable results. **Introduction:** Pre-trained models have significantly advanced in both computer vision (CV) and natural language processing (NLP). However, existing models struggle with cross-modal tasks involving long natural language inputs. Unicoder-VL addresses this by learning joint representations of vision and language using a multi-layer Transformer. It pre-trains on image-caption pairs, making it suitable for tasks like image-text retrieval and visual commonsense reasoning. **Related Work:** The paper reviews existing pre-trained models in CV and NLP, highlighting their limitations in handling cross-modal tasks with long natural language inputs. It also discusses recent cross-modal pre-training methods, comparing them with Unicoder-VL in terms of architecture, pre-training tasks, and performance. **Approach:** Unicoder-VL uses a multi-layer Transformer to encode visual and linguistic inputs, learning joint representations through cross-modal pre-training. The model is pre-trained on large-scale image-caption pairs and fine-tuned for specific downstream tasks. **Experiments:** The paper evaluates Unicoder-VL on image-text retrieval and visual commonsense reasoning tasks, demonstrating significant improvements over state-of-the-art methods. Zero-shot experiments show that Unicoder-VL can learn general cross-modal knowledge without task-specific fine-tuning. **Results and Analysis:** Unicoder-VL outperforms existing methods on image-text retrieval tasks, achieving state-of-the-art results on datasets like MSCOCO and Flickr30K. It also shows promising performance on visual commonsense reasoning tasks, indicating the effectiveness of cross-modal pre-training. **Conclusion:** Unicoder-VL is a powerful universal encoder for cross-modal tasks, leveraging large-scale image-caption pairs for pre-training. Its ability to learn joint representations of vision and language makes it effective for various**Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training** **Authors:** Gen Li, Fangyj, Nanduan, Migon, Djiang, Mingzhou **Affiliations:** School of Software & Microelectronics, Peking University, Beijing, China; Natural Language Computing, Microsoft Research Asia, Beijing, China; STCA NLP Group, Microsoft, Beijing, China **Abstract:** Unicoder-VL is a universal encoder designed to learn joint representations of vision and language through cross-modal pre-training. Inspired by cross-lingual pre-trained models like XLM and Unicoder, Unicoder-VL employs a multi-layer Transformer for cross-modal pre-training, using three tasks: Masked Language Modeling (MLM), Masked Object Classification (MOC), and Visual-linguistic Matching (VLM). These tasks help in learning context-aware representations for input tokens based on both visual and linguistic content. After pre-training on large-scale image-caption pairs, Unicoder-VL is fine-tuned for caption-based image-text retrieval and visual commonsense reasoning tasks, achieving state-of-the-art or comparable results. **Introduction:** Pre-trained models have significantly advanced in both computer vision (CV) and natural language processing (NLP). However, existing models struggle with cross-modal tasks involving long natural language inputs. Unicoder-VL addresses this by learning joint representations of vision and language using a multi-layer Transformer. It pre-trains on image-caption pairs, making it suitable for tasks like image-text retrieval and visual commonsense reasoning. **Related Work:** The paper reviews existing pre-trained models in CV and NLP, highlighting their limitations in handling cross-modal tasks with long natural language inputs. It also discusses recent cross-modal pre-training methods, comparing them with Unicoder-VL in terms of architecture, pre-training tasks, and performance. **Approach:** Unicoder-VL uses a multi-layer Transformer to encode visual and linguistic inputs, learning joint representations through cross-modal pre-training. The model is pre-trained on large-scale image-caption pairs and fine-tuned for specific downstream tasks. **Experiments:** The paper evaluates Unicoder-VL on image-text retrieval and visual commonsense reasoning tasks, demonstrating significant improvements over state-of-the-art methods. Zero-shot experiments show that Unicoder-VL can learn general cross-modal knowledge without task-specific fine-tuning. **Results and Analysis:** Unicoder-VL outperforms existing methods on image-text retrieval tasks, achieving state-of-the-art results on datasets like MSCOCO and Flickr30K. It also shows promising performance on visual commonsense reasoning tasks, indicating the effectiveness of cross-modal pre-training. **Conclusion:** Unicoder-VL is a powerful universal encoder for cross-modal tasks, leveraging large-scale image-caption pairs for pre-training. Its ability to learn joint representations of vision and language makes it effective for various

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

2 Dec 2019 | Gen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou