TextSquare: Scaling up Text-Centric Visual Instruction Tuning

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

19 Apr 2024 | Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binhong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang
TextSquare: Scaling up Text-Centric Visual Instruction Tuning This paper introduces TextSquare, a text-centric visual instruction tuning model that significantly outperforms existing open-source models and even matches state-of-the-art closed-source models on various benchmarks. The model is trained on Square-10M, a massive, high-quality text-centric visual question answering (VQA) instruction tuning dataset generated using closed-source multimodal large language models (MLLMs). The dataset is constructed through a four-step process: Self-Questioning, Answering, Reasoning, and Evaluation. The Square-10M dataset consists of 3.8 million text-rich images collected from diverse sources, and after data generation and filtering, it includes 9.1 million question-answer pairs and reasoning contexts. The dataset is used to train TextSquare, which achieves superior performance on multiple benchmarks, including OCRBench, where it scores 62.2%. TextSquare also outperforms top-tier models like GPT4V and Gemini in six out of ten text-centric benchmarks. The paper demonstrates that reasoning data is crucial for improving model performance and reducing hallucinations in text-centric VQA tasks. It also reveals the relationship between instruction tuning data scale, convergence loss, and model performance, showing that larger, higher-quality datasets lead to better model performance. TextSquare is built using a vision encoder modified from OpenAI CLIP ViT-L-14-336, an LLM based on InternLM-2, and a projector that semantically aligns vision and text tokens. The model is trained using supervised fine-tuning with Square-10M, achieving exceptional performance on text-centric VQA tasks with 8.6B parameters and an image resolution of 700. The paper also discusses the limitations of the approach, including the high computational cost of training large-scale datasets and the fact that synthetic data still cannot reach human-level performance. Overall, the study provides a data-centric perspective on the role of instruction-tuning data in text-centric VQA, confirming that both the quantity and quality of data are crucial to model performance.TextSquare: Scaling up Text-Centric Visual Instruction Tuning This paper introduces TextSquare, a text-centric visual instruction tuning model that significantly outperforms existing open-source models and even matches state-of-the-art closed-source models on various benchmarks. The model is trained on Square-10M, a massive, high-quality text-centric visual question answering (VQA) instruction tuning dataset generated using closed-source multimodal large language models (MLLMs). The dataset is constructed through a four-step process: Self-Questioning, Answering, Reasoning, and Evaluation. The Square-10M dataset consists of 3.8 million text-rich images collected from diverse sources, and after data generation and filtering, it includes 9.1 million question-answer pairs and reasoning contexts. The dataset is used to train TextSquare, which achieves superior performance on multiple benchmarks, including OCRBench, where it scores 62.2%. TextSquare also outperforms top-tier models like GPT4V and Gemini in six out of ten text-centric benchmarks. The paper demonstrates that reasoning data is crucial for improving model performance and reducing hallucinations in text-centric VQA tasks. It also reveals the relationship between instruction tuning data scale, convergence loss, and model performance, showing that larger, higher-quality datasets lead to better model performance. TextSquare is built using a vision encoder modified from OpenAI CLIP ViT-L-14-336, an LLM based on InternLM-2, and a projector that semantically aligns vision and text tokens. The model is trained using supervised fine-tuning with Square-10M, achieving exceptional performance on text-centric VQA tasks with 8.6B parameters and an image resolution of 700. The paper also discusses the limitations of the approach, including the high computational cost of training large-scale datasets and the fact that synthetic data still cannot reach human-level performance. Overall, the study provides a data-centric perspective on the role of instruction-tuning data in text-centric VQA, confirming that both the quantity and quality of data are crucial to model performance.
Reach us at info@study.space