Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

8 May 2024 | Yiming Huang, Xiao Liu, Yeyun Gong, Zhibin Gou, Yelong Shen, Nan Duan, Weizhu Chen
The paper introduces Key-Point-Driven Data Synthesis (KPDDS), a novel framework for generating high-quality, novel question-answer pairs for mathematical reasoning. KPDDS leverages key points and exemplar practices from authentic data sources to ensure the generation of rigorous and diverse questions. The authors develop KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs. By integrating additional reasoning-intensive corpora, they create the comprehensive KPMath-Plus dataset. Fine-tuning the Qwen1.5-7B model on KPMath-Plus achieves 87.0% PASS@1 accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 70B model size range and outperforming commercial models like GPT-4 across multiple math reasoning datasets. The paper also discusses related work, methodology, experimental details, and evaluation results, highlighting the effectiveness of KPDDS in enhancing mathematical reasoning capabilities of large language models.The paper introduces Key-Point-Driven Data Synthesis (KPDDS), a novel framework for generating high-quality, novel question-answer pairs for mathematical reasoning. KPDDS leverages key points and exemplar practices from authentic data sources to ensure the generation of rigorous and diverse questions. The authors develop KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs. By integrating additional reasoning-intensive corpora, they create the comprehensive KPMath-Plus dataset. Fine-tuning the Qwen1.5-7B model on KPMath-Plus achieves 87.0% PASS@1 accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 70B model size range and outperforming commercial models like GPT-4 across multiple math reasoning datasets. The paper also discusses related work, methodology, experimental details, and evaluation results, highlighting the effectiveness of KPDDS in enhancing mathematical reasoning capabilities of large language models.
Reach us at info@study.space