VISION-FLAN is a diverse visual instruction tuning dataset containing 187 tasks and 1,664,261 instances, designed to address challenges in vision-language models (VLMs), including limited task diversity in pre-training and visual instruction tuning, and annotation errors and biases in GPT-4 synthesized data. The dataset includes expert-written instructions for each task and is sourced from academic datasets. A two-stage instruction tuning framework is proposed, where VLMs are first fine-tuned on VISION-FLAN and then further tuned on GPT-4 synthesized data. This approach significantly outperforms traditional single-stage methods and achieves state-of-the-art performance on multiple benchmarks. Analysis reveals that GPT-4 synthesized data does not substantially enhance VLM capabilities but modulates responses to human-preferred formats. A minimal amount of GPT-4 data (e.g., 1,000 instances) can effectively align VLM responses with human preferences. Visual instruction tuning mainly helps large language models (LLMs) understand visual features. The study also shows that VISION-FLAN BASE performs well on comprehensive benchmarks but struggles with tasks requiring longer responses. VISION-FLAN CHAT, trained with a small amount of GPT-4 data, achieves better performance on LLaVA-Bench. The two-stage framework reduces hallucination and catastrophic forgetting. The study highlights the importance of diverse human-labeled tasks in improving VLM capabilities and shows that GPT-4 synthesized data has limited impact on performance. The results demonstrate that visual instruction tuning enhances LLMs' ability to process visual features, with the bridging module playing a key role in mapping visual features to LLM embeddings. The study also compares VISION-FLAN with existing datasets, showing its superior task diversity and coverage. The two-stage tuning framework is effective in aligning VLMs with human preferences and reducing hallucination. The study concludes that VISION-FLAN is a valuable resource for improving VLMs through diverse human-labeled tasks and that further research is needed to address limitations such as language bias and single-image task constraints.VISION-FLAN is a diverse visual instruction tuning dataset containing 187 tasks and 1,664,261 instances, designed to address challenges in vision-language models (VLMs), including limited task diversity in pre-training and visual instruction tuning, and annotation errors and biases in GPT-4 synthesized data. The dataset includes expert-written instructions for each task and is sourced from academic datasets. A two-stage instruction tuning framework is proposed, where VLMs are first fine-tuned on VISION-FLAN and then further tuned on GPT-4 synthesized data. This approach significantly outperforms traditional single-stage methods and achieves state-of-the-art performance on multiple benchmarks. Analysis reveals that GPT-4 synthesized data does not substantially enhance VLM capabilities but modulates responses to human-preferred formats. A minimal amount of GPT-4 data (e.g., 1,000 instances) can effectively align VLM responses with human preferences. Visual instruction tuning mainly helps large language models (LLMs) understand visual features. The study also shows that VISION-FLAN BASE performs well on comprehensive benchmarks but struggles with tasks requiring longer responses. VISION-FLAN CHAT, trained with a small amount of GPT-4 data, achieves better performance on LLaVA-Bench. The two-stage framework reduces hallucination and catastrophic forgetting. The study highlights the importance of diverse human-labeled tasks in improving VLM capabilities and shows that GPT-4 synthesized data has limited impact on performance. The results demonstrate that visual instruction tuning enhances LLMs' ability to process visual features, with the bridging module playing a key role in mapping visual features to LLM embeddings. The study also compares VISION-FLAN with existing datasets, showing its superior task diversity and coverage. The two-stage tuning framework is effective in aligning VLMs with human preferences and reducing hallucination. The study concludes that VISION-FLAN is a valuable resource for improving VLMs through diverse human-labeled tasks and that further research is needed to address limitations such as language bias and single-image task constraints.