19 Apr 2024 | Jingqun Tang1*, Chunhui Lin1*, Zhen Zhao2*, Shu Wei1, Binghong Wu1, Qi Liu1, Hao Feng1, Yang Li1, Siqi Wang1, Lei Liao1, Wei Shi1, Yuliang Liu3, Hao Liu1, Yuan Xie2, Xiang Bai3, Can Huang1
The paper introduces a new approach, Square, for creating a massive and high-quality instruction-tuning dataset, Square-10M, using closed-source Multimodal Large Language Models (MLLMs). The dataset is constructed through four steps: Self-Questioning, Answering, Reasoning, and Evaluation. The authors then train a model called TextSquare on this dataset, which significantly outperforms both open-source and closed-source state-of-the-art models in various text-centric Visual Question Answering (VQA) benchmarks. Key findings include:
1. **Performance Improvement**: TextSquare surpasses existing open-source models and even matches or outperforms leading closed-source models like GPT4V and Gemini.
2. **Reasoning Data Importance**: The dataset's reasoning data significantly improves model performance and reduces hallucinations.
3. **Data Scale and Performance**: The relationship between the scale of instruction tuning data and model performance is exponential, validating the necessity of large-scale and high-quality datasets.
The paper also discusses the limitations of the approach, such as the high computational cost of training large datasets and the inability to reach human-level performance. Overall, the work provides a comprehensive perspective on the role of instruction-tuning data in text-centric VQA, emphasizing the importance of both quantity and quality.The paper introduces a new approach, Square, for creating a massive and high-quality instruction-tuning dataset, Square-10M, using closed-source Multimodal Large Language Models (MLLMs). The dataset is constructed through four steps: Self-Questioning, Answering, Reasoning, and Evaluation. The authors then train a model called TextSquare on this dataset, which significantly outperforms both open-source and closed-source state-of-the-art models in various text-centric Visual Question Answering (VQA) benchmarks. Key findings include:
1. **Performance Improvement**: TextSquare surpasses existing open-source models and even matches or outperforms leading closed-source models like GPT4V and Gemini.
2. **Reasoning Data Importance**: The dataset's reasoning data significantly improves model performance and reduces hallucinations.
3. **Data Scale and Performance**: The relationship between the scale of instruction tuning data and model performance is exponential, validating the necessity of large-scale and high-quality datasets.
The paper also discusses the limitations of the approach, such as the high computational cost of training large datasets and the inability to reach human-level performance. Overall, the work provides a comprehensive perspective on the role of instruction-tuning data in text-centric VQA, emphasizing the importance of both quantity and quality.