[slides] InstructBLIP%3A Towards General-purpose Vision-Language Models with Instruction Tuning

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning **Authors:** Wenliang Dai **Affiliations:** Salesforce Research, Hong Kong University of Science and Technology, Nanyang Technological University, Singapore **Repository:** https://github.com/salesforce/LAVIS/tree/main/projects/instructblip **Abstract:** This paper presents InstructBLIP, a vision-language instruction tuning framework that enables general-purpose models to solve a wide range of visual-language tasks through a unified natural language interface. The authors conduct a systematic and comprehensive study on vision-language instruction tuning, using 26 publicly available datasets and transforming them into instruction tuning format. They introduce an instruction-aware Query Transformer (Q-Former) to extract informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP achieves state-of-the-art zero-shot performance across all 13 held-out datasets, outperforming BLIP-2 and larger Flamingo models. The models also lead to state-of-the-art performance when fine-tuned on individual downstream tasks. Qualitative examples demonstrate the advantages of InstructBLIP over concurrent multimodal models. **Key Contributions:** 1. Comprehensive and systematic study on vision-language instruction tuning. 2. Introduction of an instruction-aware visual feature extraction mechanism. 3. Evaluation and open-sourcing of InstructBLIP models using FlanT5 and Vicuna LLMs. **Methods:** - **Data Collection:** 26 datasets covering 11 task categories. - **Training and Evaluation:** 13 held-in datasets for training and 13 held-out datasets for zero-shot evaluation. - **Instruction-aware Visual Feature Extraction:** Q-Former extracts task-relevant image features. - **Balanced Data Sampling:** Samples datasets proportionally to their sizes to avoid overfitting smaller datasets. **Results:** - InstructBLIP achieves state-of-the-art zero-shot performance on various vision-language tasks. - InstructBLIP outperforms BLIP-2 and Flamingo on held-out datasets. - InstructBLIP models perform well on individual downstream tasks, such as ScienceQA. **Qualitative Examples:** - InstructBLIP demonstrates complex visual reasoning, knowledge-grounded image description, and multi-turn conversations. **Conclusion:** InstructBLIP is a novel instruction tuning framework that enhances the generalization ability of vision-language models to unseen tasks. The paper provides a comprehensive analysis and validation of its effectiveness, aiming to spur new research in general-purpose multimodal AI.InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning **Authors:** Wenliang Dai **Affiliations:** Salesforce Research, Hong Kong University of Science and Technology, Nanyang Technological University, Singapore **Repository:** https://github.com/salesforce/LAVIS/tree/main/projects/instructblip **Abstract:** This paper presents InstructBLIP, a vision-language instruction tuning framework that enables general-purpose models to solve a wide range of visual-language tasks through a unified natural language interface. The authors conduct a systematic and comprehensive study on vision-language instruction tuning, using 26 publicly available datasets and transforming them into instruction tuning format. They introduce an instruction-aware Query Transformer (Q-Former) to extract informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP achieves state-of-the-art zero-shot performance across all 13 held-out datasets, outperforming BLIP-2 and larger Flamingo models. The models also lead to state-of-the-art performance when fine-tuned on individual downstream tasks. Qualitative examples demonstrate the advantages of InstructBLIP over concurrent multimodal models. **Key Contributions:** 1. Comprehensive and systematic study on vision-language instruction tuning. 2. Introduction of an instruction-aware visual feature extraction mechanism. 3. Evaluation and open-sourcing of InstructBLIP models using FlanT5 and Vicuna LLMs. **Methods:** - **Data Collection:** 26 datasets covering 11 task categories. - **Training and Evaluation:** 13 held-in datasets for training and 13 held-out datasets for zero-shot evaluation. - **Instruction-aware Visual Feature Extraction:** Q-Former extracts task-relevant image features. - **Balanced Data Sampling:** Samples datasets proportionally to their sizes to avoid overfitting smaller datasets. **Results:** - InstructBLIP achieves state-of-the-art zero-shot performance on various vision-language tasks. - InstructBLIP outperforms BLIP-2 and Flamingo on held-out datasets. - InstructBLIP models perform well on individual downstream tasks, such as ScienceQA. **Qualitative Examples:** - InstructBLIP demonstrates complex visual reasoning, knowledge-grounded image description, and multi-turn conversations. **Conclusion:** InstructBLIP is a novel instruction tuning framework that enhances the generalization ability of vision-language models to unseen tasks. The paper provides a comprehensive analysis and validation of its effectiveness, aiming to spur new research in general-purpose multimodal AI.

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

15 Jun 2023 | Wenliang Dai†1,2*, Junnan Li†,□,1 Dongxu Li1 Anthony Meng Huat Tiong1,3 Junqi Zhao3 Weisheng Wang3 Boyang Li3 Pascale Fung2 Steven Hoi□,1