VL-BERT: PRE-TRAINING OF GENERIC VISUAL-LINGUISTIC REPRESENTATIONS

VL-BERT: PRE-TRAINING OF GENERIC VISUAL-LINGUISTIC REPRESENTATIONS

18 Feb 2020 | Weijie Su1,2*, Xizhou Zhu1,2*, Yue Cao2, Bin Li1, Lewei Lu2, Furu Wei2, Jifeng Dai2
VL-BERT is a pre-trained generic visual-linguistic representation model designed for various visual-linguistic tasks. It uses a Transformer architecture to process both visual and linguistic features, with each input element being either a word from the sentence or a region of interest (RoI) from the image. VL-BERT is pre-trained on the Conceptual Captions dataset and text-only corpora to better align visual and linguistic information, improving performance on tasks like visual commonsense reasoning, visual question answering, and referring expression comprehension. It achieves top performance on the VCR benchmark. VL-BERT's architecture includes a multi-modal Transformer that integrates visual and linguistic features, with special elements to distinguish input formats. It is pre-trained with tasks such as masked language modeling with visual clues and masked RoI classification with linguistic clues. VL-BERT is fine-tuned for various downstream tasks, showing improved performance over existing models. The model's effectiveness is validated through extensive experiments, demonstrating its ability to align visual and linguistic information and enhance downstream tasks.VL-BERT is a pre-trained generic visual-linguistic representation model designed for various visual-linguistic tasks. It uses a Transformer architecture to process both visual and linguistic features, with each input element being either a word from the sentence or a region of interest (RoI) from the image. VL-BERT is pre-trained on the Conceptual Captions dataset and text-only corpora to better align visual and linguistic information, improving performance on tasks like visual commonsense reasoning, visual question answering, and referring expression comprehension. It achieves top performance on the VCR benchmark. VL-BERT's architecture includes a multi-modal Transformer that integrates visual and linguistic features, with special elements to distinguish input formats. It is pre-trained with tasks such as masked language modeling with visual clues and masked RoI classification with linguistic clues. VL-BERT is fine-tuned for various downstream tasks, showing improved performance over existing models. The model's effectiveness is validated through extensive experiments, demonstrating its ability to align visual and linguistic information and enhance downstream tasks.
Reach us at info@study.space
[slides] VL-BERT%3A Pre-training of Generic Visual-Linguistic Representations | StudySpace