16 Dec 2024 | Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao
This paper introduces the MMIQC dataset and the Iterative Question Composing (IQC) method to enhance the mathematical reasoning capabilities of base language models. The MMIQC dataset is a mixture of processed web data and synthetic question-response pairs, designed to improve the performance of large language models (LLMs) on math problem-solving tasks. The IQC method involves iteratively generating new questions from seed problems using an LLM and applying rejection sampling to filter out incorrect answers. The dataset includes around 1.2 million question-response pairs, with additional data generated through various augmentation methods. The models fine-tuned on MMIQC consistently outperform their counterparts on the MATH benchmark, with Qwen-72B-MMIQC achieving a 45.0% accuracy, surpassing the previous open-source state-of-the-art by 8.2%. The results also show that the mathematical reasoning abilities gained through fine-tuning on MMIQC can generalize to unseen data, such as the 2023 Hungarian National High School Mathematics Finals. The study highlights the effectiveness of combining continual pre-training and supervised fine-tuning, as well as the efficiency of using multiple augmentation methods to construct datasets for fine-tuning. The results demonstrate that IQC is a powerful data augmentation method that can iteratively generate diverse data starting from a seed dataset of math word problems, leading to significant improvements in model performance.This paper introduces the MMIQC dataset and the Iterative Question Composing (IQC) method to enhance the mathematical reasoning capabilities of base language models. The MMIQC dataset is a mixture of processed web data and synthetic question-response pairs, designed to improve the performance of large language models (LLMs) on math problem-solving tasks. The IQC method involves iteratively generating new questions from seed problems using an LLM and applying rejection sampling to filter out incorrect answers. The dataset includes around 1.2 million question-response pairs, with additional data generated through various augmentation methods. The models fine-tuned on MMIQC consistently outperform their counterparts on the MATH benchmark, with Qwen-72B-MMIQC achieving a 45.0% accuracy, surpassing the previous open-source state-of-the-art by 8.2%. The results also show that the mathematical reasoning abilities gained through fine-tuning on MMIQC can generalize to unseen data, such as the 2023 Hungarian National High School Mathematics Finals. The study highlights the effectiveness of combining continual pre-training and supervised fine-tuning, as well as the efficiency of using multiple augmentation methods to construct datasets for fine-tuning. The results demonstrate that IQC is a powerful data augmentation method that can iteratively generate diverse data starting from a seed dataset of math word problems, leading to significant improvements in model performance.