26 Jun 2024 | Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng, Lidong Bing, Roy Ka-Wei Lee
Math-LLaVA is a multimodal large language model (MLLM) fine-tuned on the MathV360K dataset, which contains 40K high-quality images with question-answer pairs and 320K newly synthesized pairs, enhancing the breadth and depth of multimodal mathematical reasoning. The model is based on LLaVA-1.5 and achieves a 19-point improvement in performance on MathVista's minitest split, comparable to GPT-4V. It also shows significant improvements on the MMMU benchmark, demonstrating enhanced generalizability. The research highlights the importance of dataset diversity and synthesis in improving MLLMs' mathematical reasoning abilities. The dataset and code are available at https://github.com/HZQ950419/Math-LLaVA. The study addresses the lack of high-quality, diverse multimodal mathematical datasets by collecting and synthesizing data from 24 existing datasets. The data selection process involves filtering images based on clarity and comprehension complexity, while data augmentation generates additional questions to improve model robustness and reasoning. The model's performance is evaluated on MathVista and MMMU, showing significant improvements in accuracy and generalization. The research also explores the effectiveness of data synthesis and augmentation strategies, demonstrating their importance in enhancing multimodal mathematical reasoning. The study concludes that the proposed dataset and model significantly improve the mathematical reasoning capabilities of MLLMs.Math-LLaVA is a multimodal large language model (MLLM) fine-tuned on the MathV360K dataset, which contains 40K high-quality images with question-answer pairs and 320K newly synthesized pairs, enhancing the breadth and depth of multimodal mathematical reasoning. The model is based on LLaVA-1.5 and achieves a 19-point improvement in performance on MathVista's minitest split, comparable to GPT-4V. It also shows significant improvements on the MMMU benchmark, demonstrating enhanced generalizability. The research highlights the importance of dataset diversity and synthesis in improving MLLMs' mathematical reasoning abilities. The dataset and code are available at https://github.com/HZQ950419/Math-LLaVA. The study addresses the lack of high-quality, diverse multimodal mathematical datasets by collecting and synthesizing data from 24 existing datasets. The data selection process involves filtering images based on clarity and comprehension complexity, while data augmentation generates additional questions to improve model robustness and reasoning. The model's performance is evaluated on MathVista and MMMU, showing significant improvements in accuracy and generalization. The research also explores the effectiveness of data synthesis and augmentation strategies, demonstrating their importance in enhancing multimodal mathematical reasoning. The study concludes that the proposed dataset and model significantly improve the mathematical reasoning capabilities of MLLMs.