MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

7 Aug 2024 | Yangzhou LIU†‡, Yue CAO†‡, Zhangwei GAO†‡, Weiyun WANG†‡, Zhe CHEN†‡, Wenhai WANG†‡, Hao TIAN‡, Lewei LU‡, Xizhou ZHU†‡, Tong LU‡, Yu QIAO† & Jifeng DAI†*
The paper introduces MMINSTRUCT, a high-quality and diverse visual instruction tuning dataset designed to enhance the performance of Vision Large Language Models (VLLMs). Existing visual instruction tuning datasets suffer from limitations such as poor instruction annotation quality, limited instruction types, and lack of image diversity. To address these issues, MMINSTRUCT includes 973K instructions from 24 domains, featuring four instruction types: Judgment, Multiple-Choice, Long Visual Question Answering, and Short Visual Question Answering. The dataset is constructed using a semi-automatic, low-cost instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. This approach ensures high-quality annotations and diverse instruction types at a fraction of the cost of manual construction. Experimental results show that MMINSTRUCT significantly improves the performance of VLLMs on various benchmarks, achieving state-of-the-art results on 10 out of 12 benchmarks. The paper also includes ablation studies to validate the effectiveness of domain and question type diversity in MMINSTRUCT.The paper introduces MMINSTRUCT, a high-quality and diverse visual instruction tuning dataset designed to enhance the performance of Vision Large Language Models (VLLMs). Existing visual instruction tuning datasets suffer from limitations such as poor instruction annotation quality, limited instruction types, and lack of image diversity. To address these issues, MMINSTRUCT includes 973K instructions from 24 domains, featuring four instruction types: Judgment, Multiple-Choice, Long Visual Question Answering, and Short Visual Question Answering. The dataset is constructed using a semi-automatic, low-cost instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. This approach ensures high-quality annotations and diverse instruction types at a fraction of the cost of manual construction. Experimental results show that MMINSTRUCT significantly improves the performance of VLLMs on various benchmarks, achieving state-of-the-art results on 10 out of 12 benchmarks. The paper also includes ablation studies to validate the effectiveness of domain and question type diversity in MMINSTRUCT.
Reach us at info@study.space
Understanding MMInstruct%3A A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity