Understanding Unified Vision-Language Pre-Training for Image Captioning and VQA

This paper introduces a unified Vision-Language Pre-training (VLP) model that can be fine-tuned for both vision-language generation tasks (e.g., image captioning) and understanding tasks (e.g., visual question answering). The VLP model uses a shared multi-layer Transformer network for encoding and decoding, which is pre-trained on large amounts of image-text pairs using two unsupervised learning objectives: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. These objectives differ in the context of prediction, controlled by specific self-attention masks. The VLP model achieves state-of-the-art results on three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0, outperforming existing models that are either specialized for understanding tasks or hybrid models with separate encoders and decoders. The paper also discusses related work, experimental setup, and qualitative results, demonstrating the effectiveness of the VLP model in both speed and accuracy.This paper introduces a unified Vision-Language Pre-training (VLP) model that can be fine-tuned for both vision-language generation tasks (e.g., image captioning) and understanding tasks (e.g., visual question answering). The VLP model uses a shared multi-layer Transformer network for encoding and decoding, which is pre-trained on large amounts of image-text pairs using two unsupervised learning objectives: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. These objectives differ in the context of prediction, controlled by specific self-attention masks. The VLP model achieves state-of-the-art results on three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0, outperforming existing models that are either specialized for understanding tasks or hybrid models with separate encoders and decoders. The paper also discusses related work, experimental setup, and qualitative results, demonstrating the effectiveness of the VLP model in both speed and accuracy.

Unified Vision-Language Pre-Training for Image Captioning and VQA

4 Dec 2019 | Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao