Understanding Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

This paper introduces Key-Point-Driven Data Synthesis (KPDDS), a novel data synthesis framework that generates high-quality, reasoning-focused question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. The framework is applied to create KPMath, an extensive synthetic dataset for mathematical reasoning, containing over 800,000 question-answer pairs. By augmenting KPMath with additional reasoning-intensive corpora, the comprehensive KPMath-Plus dataset is created. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0% PASS@1 accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 70B range and best commercial models like GPT-4 across multiple math reasoning datasets. The KPDDS framework consists of two main phases: Knowledge Construction and Practice Generation. Knowledge Construction involves extracting topics and key points from seed problems using a labeling model, followed by clustering to ensure deduplication and alignment. This results in the Math Practices with Key Points (MPKP) dataset and the Topic-level Co-occurrence Probability Matrix (TCPM) to understand the frequency and distribution of topic pairs within the dataset. Practice Generation involves sampling multiple topics and key points from MPKP using the TCPM as a guide. These key points, along with corresponding example practices, serve as input for the synthesizing model to generate new questions. A scoring model then assesses the quality of these questions, allowing only those with high scores to proceed. A reasoning model then generates a range of answer options, which are later consolidated into consensus solutions through a voting mechanism. The training sets of the MATH and GSM8K datasets are used as foundational data to develop KPMath. The training corpus is further enriched by integrating a series of mathematical reasoning datasets, leading to the creation of the comprehensive training dataset, KPMath-Plus. By fine-tuning the Qwen1.5-72B model on KPMath-Plus, the model achieves zero-shot PASS@1 accuracies of 87.0% on the GSM8K test set and 58.3% on the MATH test set, culminating in a promising average of 81.5% across six math reasoning datasets. This performance exceeds that of all competitors within the 7B to 70B model size range and best commercial models like GPT-4. In the Hungarian Exam Score test, the KPMath-Plus-Mistral-7B model also outperforms the majority of models, indicating its competitive performance. The paper also discusses related work in math reasoning with LLMs and data synthesis for math reasoning. It highlights the importance of answer quality, question novelty, and synthetic data quality inThis paper introduces Key-Point-Driven Data Synthesis (KPDDS), a novel data synthesis framework that generates high-quality, reasoning-focused question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. The framework is applied to create KPMath, an extensive synthetic dataset for mathematical reasoning, containing over 800,000 question-answer pairs. By augmenting KPMath with additional reasoning-intensive corpora, the comprehensive KPMath-Plus dataset is created. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0% PASS@1 accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 70B range and best commercial models like GPT-4 across multiple math reasoning datasets. The KPDDS framework consists of two main phases: Knowledge Construction and Practice Generation. Knowledge Construction involves extracting topics and key points from seed problems using a labeling model, followed by clustering to ensure deduplication and alignment. This results in the Math Practices with Key Points (MPKP) dataset and the Topic-level Co-occurrence Probability Matrix (TCPM) to understand the frequency and distribution of topic pairs within the dataset. Practice Generation involves sampling multiple topics and key points from MPKP using the TCPM as a guide. These key points, along with corresponding example practices, serve as input for the synthesizing model to generate new questions. A scoring model then assesses the quality of these questions, allowing only those with high scores to proceed. A reasoning model then generates a range of answer options, which are later consolidated into consensus solutions through a voting mechanism. The training sets of the MATH and GSM8K datasets are used as foundational data to develop KPMath. The training corpus is further enriched by integrating a series of mathematical reasoning datasets, leading to the creation of the comprehensive training dataset, KPMath-Plus. By fine-tuning the Qwen1.5-72B model on KPMath-Plus, the model achieves zero-shot PASS@1 accuracies of 87.0% on the GSM8K test set and 58.3% on the MATH test set, culminating in a promising average of 81.5% across six math reasoning datasets. This performance exceeds that of all competitors within the 7B to 70B model size range and best commercial models like GPT-4. In the Hungarian Exam Score test, the KPMath-Plus-Mistral-7B model also outperforms the majority of models, indicating its competitive performance. The paper also discusses related work in math reasoning with LLMs and data synthesis for math reasoning. It highlights the importance of answer quality, question novelty, and synthetic data quality in

Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

8 May 2024 | Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, Weizhu Chen