MMInstruct is a high-quality, diverse multi-modal instruction tuning dataset containing 973,000 instructions across 24 domains. It includes four instruction types: Judgment, Multiple-Choice, Long Visual Question Answering (LVQA), and Short Visual Question Answering (SVQA). To construct MMInstruct, the authors propose a semi-automatic, low-cost instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. This engine enables efficient, multi-domain instruction generation at 1/6 the cost of manual construction. The dataset is designed to enhance the performance of Vision Large Language Models (VLLMs) by providing diverse and high-quality instructions. Through extensive experiments, the authors demonstrate that fine-tuning VLLMs on MMInstruct achieves state-of-the-art performance on 10 out of 12 benchmarks. The dataset also includes a variety of image types and instruction formats, ensuring broad coverage and diversity. The code and data are available at https://github.com/yuecao0119/MMINstruct. The paper also discusses the limitations of existing visual instruction tuning datasets, such as limited image diversity and annotation quality, and presents MMInstruct as a solution to these challenges. The dataset is evaluated on multiple benchmarks, showing significant improvements in performance compared to existing datasets. The authors conclude that MMInstruct is a valuable resource for improving the performance of VLLMs through instruction tuning.MMInstruct is a high-quality, diverse multi-modal instruction tuning dataset containing 973,000 instructions across 24 domains. It includes four instruction types: Judgment, Multiple-Choice, Long Visual Question Answering (LVQA), and Short Visual Question Answering (SVQA). To construct MMInstruct, the authors propose a semi-automatic, low-cost instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. This engine enables efficient, multi-domain instruction generation at 1/6 the cost of manual construction. The dataset is designed to enhance the performance of Vision Large Language Models (VLLMs) by providing diverse and high-quality instructions. Through extensive experiments, the authors demonstrate that fine-tuning VLLMs on MMInstruct achieves state-of-the-art performance on 10 out of 12 benchmarks. The dataset also includes a variety of image types and instruction formats, ensuring broad coverage and diversity. The code and data are available at https://github.com/yuecao0119/MMINstruct. The paper also discusses the limitations of existing visual instruction tuning datasets, such as limited image diversity and annotation quality, and presents MMInstruct as a solution to these challenges. The dataset is evaluated on multiple benchmarks, showing significant improvements in performance compared to existing datasets. The authors conclude that MMInstruct is a valuable resource for improving the performance of VLLMs through instruction tuning.