15 Jun 2023 | Wenliang Dai†1,2*, Junnan Li†,□,1 Dongxu Li1 Anthony Meng Huat Tiong1,3 Junqi Zhao3 Weisheng Wang3 Boyang Li3 Pascale Fung2 Steven Hoi□,1
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
**Authors:** Wenliang Dai
**Affiliations:** Salesforce Research, Hong Kong University of Science and Technology, Nanyang Technological University, Singapore
**Repository:** https://github.com/salesforce/LAVIS/tree/main/projects/instructblip
**Abstract:**
This paper presents InstructBLIP, a vision-language instruction tuning framework that enables general-purpose models to solve a wide range of visual-language tasks through a unified natural language interface. The authors conduct a systematic and comprehensive study on vision-language instruction tuning, using 26 publicly available datasets and transforming them into instruction tuning format. They introduce an instruction-aware Query Transformer (Q-Former) to extract informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP achieves state-of-the-art zero-shot performance across all 13 held-out datasets, outperforming BLIP-2 and larger Flamingo models. The models also lead to state-of-the-art performance when fine-tuned on individual downstream tasks. Qualitative examples demonstrate the advantages of InstructBLIP over concurrent multimodal models.
**Key Contributions:**
1. Comprehensive and systematic study on vision-language instruction tuning.
2. Introduction of an instruction-aware visual feature extraction mechanism.
3. Evaluation and open-sourcing of InstructBLIP models using FlanT5 and Vicuna LLMs.
**Methods:**
- **Data Collection:** 26 datasets covering 11 task categories.
- **Training and Evaluation:** 13 held-in datasets for training and 13 held-out datasets for zero-shot evaluation.
- **Instruction-aware Visual Feature Extraction:** Q-Former extracts task-relevant image features.
- **Balanced Data Sampling:** Samples datasets proportionally to their sizes to avoid overfitting smaller datasets.
**Results:**
- InstructBLIP achieves state-of-the-art zero-shot performance on various vision-language tasks.
- InstructBLIP outperforms BLIP-2 and Flamingo on held-out datasets.
- InstructBLIP models perform well on individual downstream tasks, such as ScienceQA.
**Qualitative Examples:**
- InstructBLIP demonstrates complex visual reasoning, knowledge-grounded image description, and multi-turn conversations.
**Conclusion:**
InstructBLIP is a novel instruction tuning framework that enhances the generalization ability of vision-language models to unseen tasks. The paper provides a comprehensive analysis and validation of its effectiveness, aiming to spur new research in general-purpose multimodal AI.InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
**Authors:** Wenliang Dai
**Affiliations:** Salesforce Research, Hong Kong University of Science and Technology, Nanyang Technological University, Singapore
**Repository:** https://github.com/salesforce/LAVIS/tree/main/projects/instructblip
**Abstract:**
This paper presents InstructBLIP, a vision-language instruction tuning framework that enables general-purpose models to solve a wide range of visual-language tasks through a unified natural language interface. The authors conduct a systematic and comprehensive study on vision-language instruction tuning, using 26 publicly available datasets and transforming them into instruction tuning format. They introduce an instruction-aware Query Transformer (Q-Former) to extract informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP achieves state-of-the-art zero-shot performance across all 13 held-out datasets, outperforming BLIP-2 and larger Flamingo models. The models also lead to state-of-the-art performance when fine-tuned on individual downstream tasks. Qualitative examples demonstrate the advantages of InstructBLIP over concurrent multimodal models.
**Key Contributions:**
1. Comprehensive and systematic study on vision-language instruction tuning.
2. Introduction of an instruction-aware visual feature extraction mechanism.
3. Evaluation and open-sourcing of InstructBLIP models using FlanT5 and Vicuna LLMs.
**Methods:**
- **Data Collection:** 26 datasets covering 11 task categories.
- **Training and Evaluation:** 13 held-in datasets for training and 13 held-out datasets for zero-shot evaluation.
- **Instruction-aware Visual Feature Extraction:** Q-Former extracts task-relevant image features.
- **Balanced Data Sampling:** Samples datasets proportionally to their sizes to avoid overfitting smaller datasets.
**Results:**
- InstructBLIP achieves state-of-the-art zero-shot performance on various vision-language tasks.
- InstructBLIP outperforms BLIP-2 and Flamingo on held-out datasets.
- InstructBLIP models perform well on individual downstream tasks, such as ScienceQA.
**Qualitative Examples:**
- InstructBLIP demonstrates complex visual reasoning, knowledge-grounded image description, and multi-turn conversations.
**Conclusion:**
InstructBLIP is a novel instruction tuning framework that enhances the generalization ability of vision-language models to unseen tasks. The paper provides a comprehensive analysis and validation of its effectiveness, aiming to spur new research in general-purpose multimodal AI.