15 Jun 2023 | Junnan Li Dongxu Li Silvio Savarese Steven Hoi
BLIP-2 is a vision-language pre-training method that leverages frozen pre-trained image encoders and large language models (LLMs) to achieve state-of-the-art performance with significantly fewer trainable parameters. The method uses a lightweight Querying Transformer (Q-Former) pre-trained in two stages: first, to learn visual representation relevant to text, and second, to enable vision-to-language generative learning. BLIP-2 bridges the modality gap between frozen image encoders and LLMs, allowing for zero-shot image-to-text generation and other vision-language tasks. It outperforms existing methods like Flamingo80B on zero-shot VQAv2 with 54x fewer parameters. BLIP-2 is computationally efficient and can be prompted to perform zero-shot image-to-text generation following natural language instructions, demonstrating emerging capabilities in visual knowledge reasoning and conversation. The method is generic and can benefit from more advanced unimodal models. BLIP-2 achieves strong performance on various vision-language tasks, including image captioning, visual question answering, and image-text retrieval. It is also effective in zero-shot image-text retrieval, with the ITC and ITM losses playing a crucial role. However, BLIP-2 has limitations, such as potential issues with in-context learning and risks associated with using frozen models, including outputting offensive language or leaking private information. The method is a significant step towards building a multimodal conversational AI agent.BLIP-2 is a vision-language pre-training method that leverages frozen pre-trained image encoders and large language models (LLMs) to achieve state-of-the-art performance with significantly fewer trainable parameters. The method uses a lightweight Querying Transformer (Q-Former) pre-trained in two stages: first, to learn visual representation relevant to text, and second, to enable vision-to-language generative learning. BLIP-2 bridges the modality gap between frozen image encoders and LLMs, allowing for zero-shot image-to-text generation and other vision-language tasks. It outperforms existing methods like Flamingo80B on zero-shot VQAv2 with 54x fewer parameters. BLIP-2 is computationally efficient and can be prompted to perform zero-shot image-to-text generation following natural language instructions, demonstrating emerging capabilities in visual knowledge reasoning and conversation. The method is generic and can benefit from more advanced unimodal models. BLIP-2 achieves strong performance on various vision-language tasks, including image captioning, visual question answering, and image-text retrieval. It is also effective in zero-shot image-text retrieval, with the ITC and ITM losses playing a crucial role. However, BLIP-2 has limitations, such as potential issues with in-context learning and risks associated with using frozen models, including outputting offensive language or leaking private information. The method is a significant step towards building a multimodal conversational AI agent.