Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

26 Jul 2020 | Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks This paper proposes a new pre-training method, OSCAR, for vision-language tasks. The method uses object tags detected in images as anchor points to significantly ease the learning of semantic alignments between images and texts. The model is pre-trained on a large-scale vision-language dataset consisting of 6.5 million text-image pairs and is fine-tuned on downstream tasks, achieving new state-of-the-art results on six well-established vision-language understanding and generation tasks. The key idea of OSCAR is to use object tags as anchor points to align image and text modalities. The model represents each input as a triple consisting of a word sequence, a set of object tags, and a set of image region features. The model is trained with two losses: a masked token loss over words and tags, and a contrastive loss between tags and others. The model is then fine-tuned for five understanding and two generation tasks. The model is based on multi-layer Transformers and uses self-attention to learn image-text semantic alignments. However, the lack of explicit alignment information between image regions and text poses alignment modeling as a weakly-supervised learning task. To address this, the model uses object tags as anchor points to align image regions with word embeddings of pre-trained language models. The model is pre-trained on a large-scale vision-language dataset and is fine-tuned on seven vision-language understanding and generation tasks. The model achieves new state-of-the-art results on six of these tasks. The model is also effective in tasks such as image-text retrieval, image captioning, VQA, GQA, and NLVR2. The model is evaluated on several benchmark tasks and shows significant improvements over existing methods. The model is also effective in tasks that require reasoning and understanding of visual and linguistic information. The model is able to generate detailed descriptions of images and answer natural language questions based on images. The model is also effective in tasks that require reasoning and understanding of visual and linguistic information. The model is based on the use of object tags as anchor points to align image and text modalities. The model is pre-trained on a large-scale vision-language dataset and is fine-tuned on downstream tasks. The model achieves new state-of-the-art results on six of these tasks. The model is also effective in tasks such as image-text retrieval, image captioning, VQA, GQA, and NLVR2. The model is able to generate detailed descriptions of images and answer natural language questions based on images. The model is also effective in tasks that require reasoning and understanding of visual and linguistic information.Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks This paper proposes a new pre-training method, OSCAR, for vision-language tasks. The method uses object tags detected in images as anchor points to significantly ease the learning of semantic alignments between images and texts. The model is pre-trained on a large-scale vision-language dataset consisting of 6.5 million text-image pairs and is fine-tuned on downstream tasks, achieving new state-of-the-art results on six well-established vision-language understanding and generation tasks. The key idea of OSCAR is to use object tags as anchor points to align image and text modalities. The model represents each input as a triple consisting of a word sequence, a set of object tags, and a set of image region features. The model is trained with two losses: a masked token loss over words and tags, and a contrastive loss between tags and others. The model is then fine-tuned for five understanding and two generation tasks. The model is based on multi-layer Transformers and uses self-attention to learn image-text semantic alignments. However, the lack of explicit alignment information between image regions and text poses alignment modeling as a weakly-supervised learning task. To address this, the model uses object tags as anchor points to align image regions with word embeddings of pre-trained language models. The model is pre-trained on a large-scale vision-language dataset and is fine-tuned on seven vision-language understanding and generation tasks. The model achieves new state-of-the-art results on six of these tasks. The model is also effective in tasks such as image-text retrieval, image captioning, VQA, GQA, and NLVR2. The model is evaluated on several benchmark tasks and shows significant improvements over existing methods. The model is also effective in tasks that require reasoning and understanding of visual and linguistic information. The model is able to generate detailed descriptions of images and answer natural language questions based on images. The model is also effective in tasks that require reasoning and understanding of visual and linguistic information. The model is based on the use of object tags as anchor points to align image and text modalities. The model is pre-trained on a large-scale vision-language dataset and is fine-tuned on downstream tasks. The model achieves new state-of-the-art results on six of these tasks. The model is also effective in tasks such as image-text retrieval, image captioning, VQA, GQA, and NLVR2. The model is able to generate detailed descriptions of images and answer natural language questions based on images. The model is also effective in tasks that require reasoning and understanding of visual and linguistic information.
Reach us at info@study.space