Understanding Vision-Flan%3A Scaling Human-Labeled Tasks in Visual Instruction Tuning

The paper "VISION-FLAN: Scaling Human-Labeled Tasks in Visual Instruction Tuning" addresses two significant challenges in vision-language models (VLMs): the lack of task diversity in pretraining and visual instruction tuning, and annotation errors and bias in GPT-4 synthesized instruction tuning data. To tackle these issues, the authors introduce VISION-FLAN, a diverse public dataset comprising 187 tasks and 1,664,261 instances from academic datasets, each accompanied by expert-written instructions. They propose a two-stage instruction tuning framework where VLMs are first fine-tuned on VISION-FLAN and then further tuned on GPT-4 synthesized data. This approach significantly outperforms traditional single-stage visual instruction tuning frameworks and achieves state-of-the-art performance on various multi-modal evaluation benchmarks. The study also reveals that GPT-4 synthesized data does not substantially enhance VLMs' capabilities but modulates responses to human-preferred formats, and minimal GPT-4 synthesized data (e.g., 1,000 instances) can effectively align VLMs' responses with human preferences. Additionally, visual instruction tuning primarily helps large-language models (LLMs) understand visual features. The paper includes detailed analyses and insights into the contributions of human-labeled and GPT-4 synthesized data, as well as the impact of different training strategies.The paper "VISION-FLAN: Scaling Human-Labeled Tasks in Visual Instruction Tuning" addresses two significant challenges in vision-language models (VLMs): the lack of task diversity in pretraining and visual instruction tuning, and annotation errors and bias in GPT-4 synthesized instruction tuning data. To tackle these issues, the authors introduce VISION-FLAN, a diverse public dataset comprising 187 tasks and 1,664,261 instances from academic datasets, each accompanied by expert-written instructions. They propose a two-stage instruction tuning framework where VLMs are first fine-tuned on VISION-FLAN and then further tuned on GPT-4 synthesized data. This approach significantly outperforms traditional single-stage visual instruction tuning frameworks and achieves state-of-the-art performance on various multi-modal evaluation benchmarks. The study also reveals that GPT-4 synthesized data does not substantially enhance VLMs' capabilities but modulates responses to human-preferred formats, and minimal GPT-4 synthesized data (e.g., 1,000 instances) can effectively align VLMs' responses with human preferences. Additionally, visual instruction tuning primarily helps large-language models (LLMs) understand visual features. The paper includes detailed analyses and insights into the contributions of human-labeled and GPT-4 synthesized data, as well as the impact of different training strategies.

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

18 Feb 2024 | Zhiyang Xu, Chao Feng, Rulin Shao, Trevor Ashby, Ying Shen, Di Jin, Yu Cheng, Qifan Wang, Lifu Huang