[slides and audio] ViLBERT%3A Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

ViLBERT (Vision-and-Language BERT) is a model designed to learn task-agnostic joint representations of image content and natural language. It extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs through co-attentional transformer layers. The model is pre-trained on the Conceptual Captions dataset using two proxy tasks: predicting masked words and image regions, and predicting whether an image and text segment correspond. After pretraining, ViLBERT is transferred to multiple vision-and-language tasks, including visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval, achieving state-of-the-art performance on all tasks. The key innovation is the two-stream architecture, which allows for variable depths and sparse interaction between modalities, outperforming a single-stream unified model. The approach demonstrates that visual grounding can be learned as a pretrained and transferable capability, shifting the focus from task-specific training to a more general model for visual grounding.ViLBERT (Vision-and-Language BERT) is a model designed to learn task-agnostic joint representations of image content and natural language. It extends the BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs through co-attentional transformer layers. The model is pre-trained on the Conceptual Captions dataset using two proxy tasks: predicting masked words and image regions, and predicting whether an image and text segment correspond. After pretraining, ViLBERT is transferred to multiple vision-and-language tasks, including visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval, achieving state-of-the-art performance on all tasks. The key innovation is the two-stream architecture, which allows for variable depths and sparse interaction between modalities, outperforming a single-stream unified model. The approach demonstrates that visual grounding can be learned as a pretrained and transferable capability, shifting the focus from task-specific training to a more general model for visual grounding.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

6 Aug 2019 | Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee