9 Aug 2019 | Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh & Kai-Wei Chang
VisualBERT is a simple and flexible framework designed to model a wide range of vision-and-language tasks. It integrates BERT, a Transformer-based model for natural language processing, with pre-trained object proposal systems like Faster-RCNN. VisualBERT processes image features and text inputs through multiple Transformer layers, allowing implicit alignment between elements of the text and regions in the image. The model is pre-trained on image caption data using two visually-grounded language model objectives: masked language modeling and sentence-image prediction. Experiments on four vision-and-language tasks—VQA, VCR, NLVR², and Flickr30K—show that VisualBERT outperforms or rivals state-of-the-art models while being significantly simpler. Analysis reveals that VisualBERT can ground language elements to image regions without explicit supervision and is sensitive to syntactic relationships, tracking associations between verbs and their arguments.VisualBERT is a simple and flexible framework designed to model a wide range of vision-and-language tasks. It integrates BERT, a Transformer-based model for natural language processing, with pre-trained object proposal systems like Faster-RCNN. VisualBERT processes image features and text inputs through multiple Transformer layers, allowing implicit alignment between elements of the text and regions in the image. The model is pre-trained on image caption data using two visually-grounded language model objectives: masked language modeling and sentence-image prediction. Experiments on four vision-and-language tasks—VQA, VCR, NLVR², and Flickr30K—show that VisualBERT outperforms or rivals state-of-the-art models while being significantly simpler. Analysis reveals that VisualBERT can ground language elements to image regions without explicit supervision and is sensitive to syntactic relationships, tracking associations between verbs and their arguments.