MathScale: Scaling Instruction Tuning for Mathematical Reasoning

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

5 Mar 2024 | Zhengyang Tang, Xingxing Zhang, Benyou Wang, Furu Wei
MathScale is a method for generating high-quality mathematical reasoning data using frontier large language models (LLMs) like GPT-3.5. It leverages cognitive mechanisms in human mathematical learning by first extracting topics and knowledge points from seed math questions, then building a concept graph to generate new math questions. This approach enables scalable data generation, resulting in a dataset called MathScaleQA containing two million math question-answer pairs. To evaluate mathematical reasoning abilities comprehensively, MWPBENCH is introduced, a benchmark consisting of ten datasets covering K-12, college, and competition-level math problems. MathScaleQA is used to fine-tune open-source LLMs like LLaMA-2 and Mistral, significantly improving their mathematical reasoning capabilities. On MWPBENCH, MathScale-7B achieves state-of-the-art performance, outperforming its peers by 42.9% in micro average accuracy and 43.7% in macro average accuracy. The method also demonstrates scalability, with performance nearly logarithmically increasing as the dataset size grows. Ablation studies show that the number of seed questions and extracted concepts significantly impact performance, with more diverse and comprehensive seed data leading to better results. Validation steps were considered but ultimately omitted due to limited effectiveness. MathScale also performs well on a fresh math dataset, FreshGaokaoMath-2023, demonstrating robustness in handling new problems. The method addresses limitations of previous approaches by enabling LLMs to emulate human cognitive processes in mathematical learning. Overall, MathScale provides a scalable and effective way to generate high-quality mathematical reasoning data, enhancing the evaluation and training of LLMs in mathematical tasks.MathScale is a method for generating high-quality mathematical reasoning data using frontier large language models (LLMs) like GPT-3.5. It leverages cognitive mechanisms in human mathematical learning by first extracting topics and knowledge points from seed math questions, then building a concept graph to generate new math questions. This approach enables scalable data generation, resulting in a dataset called MathScaleQA containing two million math question-answer pairs. To evaluate mathematical reasoning abilities comprehensively, MWPBENCH is introduced, a benchmark consisting of ten datasets covering K-12, college, and competition-level math problems. MathScaleQA is used to fine-tune open-source LLMs like LLaMA-2 and Mistral, significantly improving their mathematical reasoning capabilities. On MWPBENCH, MathScale-7B achieves state-of-the-art performance, outperforming its peers by 42.9% in micro average accuracy and 43.7% in macro average accuracy. The method also demonstrates scalability, with performance nearly logarithmically increasing as the dataset size grows. Ablation studies show that the number of seed questions and extracted concepts significantly impact performance, with more diverse and comprehensive seed data leading to better results. Validation steps were considered but ultimately omitted due to limited effectiveness. MathScale also performs well on a fresh math dataset, FreshGaokaoMath-2023, demonstrating robustness in handling new problems. The method addresses limitations of previous approaches by enabling LLMs to emulate human cognitive processes in mathematical learning. Overall, MathScale provides a scalable and effective way to generate high-quality mathematical reasoning data, enhancing the evaluation and training of LLMs in mathematical tasks.
Reach us at info@study.space
[slides] MathScale%3A Scaling Instruction Tuning for Mathematical Reasoning | StudySpace