Unified Vision-Language Pre-Training for Image Captioning and VQA

Unified Vision-Language Pre-Training for Image Captioning and VQA

4 Dec 2019 | Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao
This paper introduces a unified Vision-Language Pre-training (VLP) model that can be fine-tuned for both vision-language generation (e.g., image captioning) and understanding (e.g., visual question answering) tasks. The model uses a shared multi-layer transformer network for both encoding and decoding, which differs from existing methods that use separate models for encoding and decoding. The VLP model is pre-trained on a large amount of image-text pairs using two unsupervised learning objectives: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. These tasks differ only in the context used for prediction, which is controlled by specific self-attention masks for the shared transformer network. The VLP model achieves state-of-the-art results on three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The model outperforms existing methods in both image captioning and visual question answering tasks. The VLP model is trained on a unified encoder-decoder structure, which allows for more effective cross-task knowledge sharing and reduces the need for multiple pre-trained models for different tasks. The model is validated on three challenging benchmarks and shows significant improvements in both training speed and overall accuracy compared to random initialization or language-only pre-training. The model is implemented using a shared multi-layer transformer network and is pre-trained on large amounts of image-caption pairs. The model is optimized for two unsupervised vision-language prediction tasks: bidirectional and seq2seq masked language prediction. The two tasks differ solely in what context the prediction conditions on. The model is evaluated on three challenging benchmarks and shows significant improvements in both training speed and overall accuracy compared to random initialization or language-only pre-training. The model is implemented using a shared multi-layer transformer network and is pre-trained on large amounts of image-caption pairs. The model is optimized for two unsupervised vision-language prediction tasks: bidirectional and seq2seq masked language prediction. The two tasks differ solely in what context the prediction conditions on. The model is evaluated on three challenging benchmarks and shows significant improvements in both training speed and overall accuracy compared to random initialization or language-only pre-training.This paper introduces a unified Vision-Language Pre-training (VLP) model that can be fine-tuned for both vision-language generation (e.g., image captioning) and understanding (e.g., visual question answering) tasks. The model uses a shared multi-layer transformer network for both encoding and decoding, which differs from existing methods that use separate models for encoding and decoding. The VLP model is pre-trained on a large amount of image-text pairs using two unsupervised learning objectives: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. These tasks differ only in the context used for prediction, which is controlled by specific self-attention masks for the shared transformer network. The VLP model achieves state-of-the-art results on three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The model outperforms existing methods in both image captioning and visual question answering tasks. The VLP model is trained on a unified encoder-decoder structure, which allows for more effective cross-task knowledge sharing and reduces the need for multiple pre-trained models for different tasks. The model is validated on three challenging benchmarks and shows significant improvements in both training speed and overall accuracy compared to random initialization or language-only pre-training. The model is implemented using a shared multi-layer transformer network and is pre-trained on large amounts of image-caption pairs. The model is optimized for two unsupervised vision-language prediction tasks: bidirectional and seq2seq masked language prediction. The two tasks differ solely in what context the prediction conditions on. The model is evaluated on three challenging benchmarks and shows significant improvements in both training speed and overall accuracy compared to random initialization or language-only pre-training. The model is implemented using a shared multi-layer transformer network and is pre-trained on large amounts of image-caption pairs. The model is optimized for two unsupervised vision-language prediction tasks: bidirectional and seq2seq masked language prediction. The two tasks differ solely in what context the prediction conditions on. The model is evaluated on three challenging benchmarks and shows significant improvements in both training speed and overall accuracy compared to random initialization or language-only pre-training.
Reach us at info@study.space
Understanding Unified Vision-Language Pre-Training for Image Captioning and VQA