ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 Aug 2019 | Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
ViLBERT is a model designed to learn task-agnostic joint representations of image content and natural language. It extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. The model is pretrained on the Conceptual Captions dataset and then transferred to multiple vision-and-language tasks, such as visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval. ViLBERT achieves state-of-the-art results on these tasks, demonstrating the effectiveness of pretraining for visual grounding. The model's two-stream architecture allows for variable depths for each modality and enables sparse interaction through co-attention. It uses two proxy tasks during pretraining: masked language modeling and multi-modal alignment prediction. The model is evaluated on four established vision-and-language tasks, showing significant improvements over task-specific models. The results indicate that ViLBERT's pretraining enables effective visual grounding, which can be transferred to various vision-and-language tasks without task-specific fine-tuning. The model's architecture and training process are described in detail, including the use of co-attentional transformer layers and the impact of different training settings on performance. The study highlights the importance of pretraining for visual grounding and the potential of ViLBERT as a versatile model for vision-and-language tasks.ViLBERT is a model designed to learn task-agnostic joint representations of image content and natural language. It extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. The model is pretrained on the Conceptual Captions dataset and then transferred to multiple vision-and-language tasks, such as visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval. ViLBERT achieves state-of-the-art results on these tasks, demonstrating the effectiveness of pretraining for visual grounding. The model's two-stream architecture allows for variable depths for each modality and enables sparse interaction through co-attention. It uses two proxy tasks during pretraining: masked language modeling and multi-modal alignment prediction. The model is evaluated on four established vision-and-language tasks, showing significant improvements over task-specific models. The results indicate that ViLBERT's pretraining enables effective visual grounding, which can be transferred to various vision-and-language tasks without task-specific fine-tuning. The model's architecture and training process are described in detail, including the use of co-attentional transformer layers and the impact of different training settings on performance. The study highlights the importance of pretraining for visual grounding and the potential of ViLBERT as a versatile model for vision-and-language tasks.
Reach us at info@study.space
[slides] ViLBERT%3A Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks | StudySpace