7 Mar 2024 | Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, Houwen Peng
This paper demonstrates that common 7B language models, such as the LLaMA-2 7B, already possess strong mathematical capabilities without extensive math-related pre-training. The model achieves impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks, respectively, when selecting the best response from 256 random generations. However, the primary issue is the instability in consistently generating correct answers, with accuracy dropping to 49.5% and 7.9% on the same benchmarks when considering only one random generation per question. To address this, the authors scale up the supervised fine-tuning (SFT) data using synthetic math questions generated by GPT-4 Turbo. This approach significantly improves the reliability of generating correct answers, achieving an accuracy of 82.6% on GSM8K and 40.6% on MATH, surpassing previous models by 14.2% and 20.8%, respectively. The study also provides insights into the scaling behaviors across different reasoning complexities and error types, showing that calculation errors are more easily mitigated compared to reasoning errors. The results highlight the effectiveness of synthetic SFT data in enhancing the mathematical capabilities of common language models.This paper demonstrates that common 7B language models, such as the LLaMA-2 7B, already possess strong mathematical capabilities without extensive math-related pre-training. The model achieves impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks, respectively, when selecting the best response from 256 random generations. However, the primary issue is the instability in consistently generating correct answers, with accuracy dropping to 49.5% and 7.9% on the same benchmarks when considering only one random generation per question. To address this, the authors scale up the supervised fine-tuning (SFT) data using synthetic math questions generated by GPT-4 Turbo. This approach significantly improves the reliability of generating correct answers, achieving an accuracy of 82.6% on GSM8K and 40.6% on MATH, surpassing previous models by 14.2% and 20.8%, respectively. The study also provides insights into the scaling behaviors across different reasoning complexities and error types, showing that calculation errors are more easily mitigated compared to reasoning errors. The results highlight the effectiveness of synthetic SFT data in enhancing the mathematical capabilities of common language models.