BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

15 Feb 2022 | Junnan Li Dongxu Li Caiming Xiong Steven Hoi
BLIP is a new vision-language pre-training (VLP) framework that achieves state-of-the-art performance on a wide range of vision-language tasks, including image-text retrieval, image captioning, and visual question answering. The framework addresses the limitations of existing models that excel in either understanding or generation tasks, and improves performance by bootstrapping noisy web data through a captioner and filter. The captioner generates synthetic captions, while the filter removes noisy ones, leading to better performance on downstream tasks. BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. The model is based on a multimodal mixture of encoder-decoder (MED) architecture, which can operate as a unimodal encoder, image-grounded text encoder, or image-grounded text decoder. The model is pre-trained with three objectives: image-text contrastive learning, image-text matching, and language modeling. BLIP achieves significant improvements in performance on various tasks, including a +2.7% increase in average recall@1 for image-text retrieval, a +2.8% increase in CIDEr for image captioning, and a +1.6% increase in VQA score. The framework also shows strong performance on video-language tasks when directly transferred without additional training. The model and datasets are released for further research.BLIP is a new vision-language pre-training (VLP) framework that achieves state-of-the-art performance on a wide range of vision-language tasks, including image-text retrieval, image captioning, and visual question answering. The framework addresses the limitations of existing models that excel in either understanding or generation tasks, and improves performance by bootstrapping noisy web data through a captioner and filter. The captioner generates synthetic captions, while the filter removes noisy ones, leading to better performance on downstream tasks. BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. The model is based on a multimodal mixture of encoder-decoder (MED) architecture, which can operate as a unimodal encoder, image-grounded text encoder, or image-grounded text decoder. The model is pre-trained with three objectives: image-text contrastive learning, image-text matching, and language modeling. BLIP achieves significant improvements in performance on various tasks, including a +2.7% increase in average recall@1 for image-text retrieval, a +2.8% increase in CIDEr for image captioning, and a +1.6% increase in VQA score. The framework also shows strong performance on video-language tasks when directly transferred without additional training. The model and datasets are released for further research.
Reach us at info@study.space
[slides] BLIP%3A Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation | StudySpace