[slides and audio] BLIP%3A Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP (Bootstrapping Language-Image Pre-training) is a novel Vision-Language Pre-training (VLP) framework designed to excel in both understanding-based and generation-based tasks. It addresses the limitations of existing VLP models, which often perform poorly on one type of task due to the suboptimal nature of noisy web data. BLIP introduces two key contributions: a Multimodal Mixture of Encoder-Decoder (MED) model architecture and a Captioning and Filtering (CapFilt) method for dataset bootstrapping. The MED model is a flexible architecture that can operate as a unimodal encoder, image-grounded text encoder, or decoder, enabling it to handle a wide range of downstream tasks. It is pre-trained with three objectives: image-text contrastive learning, image-text matching, and image-conditioned language modeling. The CapFilt method involves a captioner that generates synthetic captions from web images and a filter that removes noisy captions, improving the quality of the training dataset. BLIP achieves state-of-the-art performance on various vision-language tasks, including image-text retrieval, image captioning, visual question answering, visual reasoning, and visual dialog. It also demonstrates strong generalization to video-language tasks when transferred zero-shot. The paper provides extensive experimental results and analysis, highlighting the effectiveness of CapFilt and the benefits of using diverse synthetic captions. The code, models, and datasets are released to facilitate further research in vision-language understanding and generation.BLIP (Bootstrapping Language-Image Pre-training) is a novel Vision-Language Pre-training (VLP) framework designed to excel in both understanding-based and generation-based tasks. It addresses the limitations of existing VLP models, which often perform poorly on one type of task due to the suboptimal nature of noisy web data. BLIP introduces two key contributions: a Multimodal Mixture of Encoder-Decoder (MED) model architecture and a Captioning and Filtering (CapFilt) method for dataset bootstrapping. The MED model is a flexible architecture that can operate as a unimodal encoder, image-grounded text encoder, or decoder, enabling it to handle a wide range of downstream tasks. It is pre-trained with three objectives: image-text contrastive learning, image-text matching, and image-conditioned language modeling. The CapFilt method involves a captioner that generates synthetic captions from web images and a filter that removes noisy captions, improving the quality of the training dataset. BLIP achieves state-of-the-art performance on various vision-language tasks, including image-text retrieval, image captioning, visual question answering, visual reasoning, and visual dialog. It also demonstrates strong generalization to video-language tasks when transferred zero-shot. The paper provides extensive experimental results and analysis, highlighting the effectiveness of CapFilt and the benefits of using diverse synthetic captions. The code, models, and datasets are released to facilitate further research in vision-language understanding and generation.

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

15 Feb 2022 | Junnan Li Dongxu Li Caiming Xiong Steven Hoi