[slides and audio] Oscar%3A Object-Semantics Aligned Pre-training for Vision-Language Tasks

The paper introduces OSCAR, a novel pre-training method for vision-language tasks that leverages object tags detected in images as anchor points to improve the learning of semantic alignments between images and text. Existing methods often concatenate image region features and text features, using self-attention to learn alignments, which can be noisy and ambiguous due to over-sampling and object overlap. OSCAR pre-trains models on a large corpus of 6.5 million text-image pairs, using a Word-Tag-Image triple representation. The model is pre-trained with a masked token loss and a contrastive loss, and fine-tuned on downstream tasks. Experiments show that OSCAR achieves state-of-the-art performance on six well-established vision-language understanding and generation tasks, demonstrating the effectiveness of using object tags as anchor points. The paper also includes ablation studies and qualitative analyses to validate the contributions of OSCAR's design choices.The paper introduces OSCAR, a novel pre-training method for vision-language tasks that leverages object tags detected in images as anchor points to improve the learning of semantic alignments between images and text. Existing methods often concatenate image region features and text features, using self-attention to learn alignments, which can be noisy and ambiguous due to over-sampling and object overlap. OSCAR pre-trains models on a large corpus of 6.5 million text-image pairs, using a Word-Tag-Image triple representation. The model is pre-trained with a masked token loss and a contrastive loss, and fine-tuned on downstream tasks. Experiments show that OSCAR achieves state-of-the-art performance on six well-established vision-language understanding and generation tasks, demonstrating the effectiveness of using object tags as anchor points. The paper also includes ablation studies and qualitative analyses to validate the contributions of OSCAR's design choices.

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

26 Jul 2020 | Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao