16 Dec 2024 | Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao
This paper addresses the challenge of enhancing large language models (LLMs) for solving complex mathematical problems, particularly in the context of open-source LLMs without external tools. The authors introduce the MMIQC dataset, which combines processed web data and synthetic question-response pairs to improve the mathematical reasoning capabilities of base language models. The dataset includes around 1200k question-response pairs from math.stackexchange.com and synthetic data generated using various augmentation methods, including iterative question composing (IQC), answer augmentation, and rejection sampling.
The key contribution of the paper is the proposed IQC method, which iteratively generates new questions from seed problems using an LLM and applies rejection sampling to filter out incorrect answers. This method significantly enhances the diversity and quality of the dataset, leading to improved performance on the MATH benchmark. Notably, the model Qwen-7B-MMIQC achieves a 45.0% accuracy, surpassing the previous open-source state-of-the-art by 8.2% and outperforming the initial version of GPT-4 released in 2023.
The authors also evaluate the models on the 2023 Hungarian national high school finals in mathematics, demonstrating that the improved models can generalize to unseen data. The paper includes a detailed evaluation of different subsets of MMIQC and an ablation study to understand the contributions of each component to the overall performance. The results highlight the effectiveness of the proposed methods in enhancing the mathematical reasoning abilities of LLMs.This paper addresses the challenge of enhancing large language models (LLMs) for solving complex mathematical problems, particularly in the context of open-source LLMs without external tools. The authors introduce the MMIQC dataset, which combines processed web data and synthetic question-response pairs to improve the mathematical reasoning capabilities of base language models. The dataset includes around 1200k question-response pairs from math.stackexchange.com and synthetic data generated using various augmentation methods, including iterative question composing (IQC), answer augmentation, and rejection sampling.
The key contribution of the paper is the proposed IQC method, which iteratively generates new questions from seed problems using an LLM and applies rejection sampling to filter out incorrect answers. This method significantly enhances the diversity and quality of the dataset, leading to improved performance on the MATH benchmark. Notably, the model Qwen-7B-MMIQC achieves a 45.0% accuracy, surpassing the previous open-source state-of-the-art by 8.2% and outperforming the initial version of GPT-4 released in 2023.
The authors also evaluate the models on the 2023 Hungarian national high school finals in mathematics, demonstrating that the improved models can generalize to unseen data. The paper includes a detailed evaluation of different subsets of MMIQC and an ablation study to understand the contributions of each component to the overall performance. The results highlight the effectiveness of the proposed methods in enhancing the mathematical reasoning abilities of LLMs.