InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

15 Jun 2023 | Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
InstructBLIP is a vision-language instruction tuning framework that enables general-purpose models to solve a wide range of visual-language tasks through a unified natural language interface. The paper presents a comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. The researchers gathered 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transformed them into instruction tuning format. They introduced an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP achieves state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. The models also lead to state-of-the-art performance when fine-tuned on individual downstream tasks. The paper also demonstrates the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced. The paper introduces a systematic study on vision-language instruction tuning, transforming 26 datasets into instruction tuning format and grouping them into 11 task categories. The researchers used 13 held-in datasets for instruction tuning and 13 held-out datasets for zero-shot evaluation. They also withheld four entire task categories for zero-shot evaluation at the task level. Exhaustive quantitative and qualitative results demonstrate the effectiveness of InstructBLIP on vision-language zero-shot generalization. The paper proposes instruction-aware visual feature extraction, a novel mechanism that enables flexible and informative feature extraction according to the given instructions. The textual instruction is given not only to the frozen LLM but also to the Q-Former, so that it can extract instruction-aware visual features from the frozen image encoder. Additionally, the researchers propose a balanced sampling strategy to synchronize learning progress across datasets. The paper evaluates and open-sources a suite of InstructBLIP models using two families of LLMs: 1) FlanT5, an encoder-decoder LLM fine-tuned from T5; 2) Vicuna, a decoder-only LLM fine-tuned from LLaMA. The InstructBLIP models achieve state-of-the-art zero-shot performance on a wide range of vision-language tasks. Furthermore, InstructBLIP models lead to state-of-the-art fine-tuning performance when used as the model initialization on individual downstream tasks. The paper also compares InstructBLIP with multitask learning and shows that instruction tuning yields significant improvements over multitask learning on unseen held-out datasets. The paper further finetunes the InstructBLIP models to investigate its performance on learning a specific dataset. Compared to most previous methods, InstructBLIP provides a better weight initialization model and achieves state-of-the-art performance on three out of four datasets. The paper also discusses related work, including instruction tuning, multitask learning, and vision-language instruction tuning. The paper concludes that InstructBLIP is a simple yet novel instruction tuning framework towards generalizedInstructBLIP is a vision-language instruction tuning framework that enables general-purpose models to solve a wide range of visual-language tasks through a unified natural language interface. The paper presents a comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. The researchers gathered 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transformed them into instruction tuning format. They introduced an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP achieves state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. The models also lead to state-of-the-art performance when fine-tuned on individual downstream tasks. The paper also demonstrates the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced. The paper introduces a systematic study on vision-language instruction tuning, transforming 26 datasets into instruction tuning format and grouping them into 11 task categories. The researchers used 13 held-in datasets for instruction tuning and 13 held-out datasets for zero-shot evaluation. They also withheld four entire task categories for zero-shot evaluation at the task level. Exhaustive quantitative and qualitative results demonstrate the effectiveness of InstructBLIP on vision-language zero-shot generalization. The paper proposes instruction-aware visual feature extraction, a novel mechanism that enables flexible and informative feature extraction according to the given instructions. The textual instruction is given not only to the frozen LLM but also to the Q-Former, so that it can extract instruction-aware visual features from the frozen image encoder. Additionally, the researchers propose a balanced sampling strategy to synchronize learning progress across datasets. The paper evaluates and open-sources a suite of InstructBLIP models using two families of LLMs: 1) FlanT5, an encoder-decoder LLM fine-tuned from T5; 2) Vicuna, a decoder-only LLM fine-tuned from LLaMA. The InstructBLIP models achieve state-of-the-art zero-shot performance on a wide range of vision-language tasks. Furthermore, InstructBLIP models lead to state-of-the-art fine-tuning performance when used as the model initialization on individual downstream tasks. The paper also compares InstructBLIP with multitask learning and shows that instruction tuning yields significant improvements over multitask learning on unseen held-out datasets. The paper further finetunes the InstructBLIP models to investigate its performance on learning a specific dataset. Compared to most previous methods, InstructBLIP provides a better weight initialization model and achieves state-of-the-art performance on three out of four datasets. The paper also discusses related work, including instruction tuning, multitask learning, and vision-language instruction tuning. The paper concludes that InstructBLIP is a simple yet novel instruction tuning framework towards generalized
Reach us at info@study.space
Understanding InstructBLIP%3A Towards General-purpose Vision-Language Models with Instruction Tuning