[slides and audio] BLIP-2%3A Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

BLIP-2 is a novel and efficient pre-training strategy for vision-language models that leverages off-the-shelf frozen pre-trained image encoders and large language models (LLMs). The method, named BLIP-2, aims to bridge the modality gap between vision and language using a lightweight Querying Transformer (Q-Former) pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder, while the second stage bootstraps vision-to-language generative learning from a frozen LLM. This approach achieves state-of-the-art performance on various vision-language tasks, such as visual question answering, image captioning, and image-text retrieval, despite having significantly fewer trainable parameters compared to existing methods. BLIP-2 also demonstrates emerging capabilities in zero-shot image-to-text generation, following natural language instructions, enabling visual knowledge reasoning, visual conversation, and more. The key advantages of BLIP-2 include its ability to effectively leverage both frozen image encoders and LLMs, its compute efficiency, and its generic nature, making it a promising step towards building multimodal conversational AI agents.BLIP-2 is a novel and efficient pre-training strategy for vision-language models that leverages off-the-shelf frozen pre-trained image encoders and large language models (LLMs). The method, named BLIP-2, aims to bridge the modality gap between vision and language using a lightweight Querying Transformer (Q-Former) pre-trained in two stages. The first stage bootstraps vision-language representation learning from a frozen image encoder, while the second stage bootstraps vision-to-language generative learning from a frozen LLM. This approach achieves state-of-the-art performance on various vision-language tasks, such as visual question answering, image captioning, and image-text retrieval, despite having significantly fewer trainable parameters compared to existing methods. BLIP-2 also demonstrates emerging capabilities in zero-shot image-to-text generation, following natural language instructions, enabling visual knowledge reasoning, visual conversation, and more. The key advantages of BLIP-2 include its ability to effectively leverage both frozen image encoders and LLMs, its compute efficiency, and its generic nature, making it a promising step towards building multimodal conversational AI agents.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

15 Jun 2023 | Junnan Li Dongxu Li Silvio Savarese Steven Hoi