7 Mar 2024 | Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, Houwen Peng
Common 7B language models, such as LLaMA-2 7B, already demonstrate strong mathematical capabilities, as shown by their high accuracy on the GSM8K and MATH benchmarks. The LLaMA-2 7B model achieves 97.7% accuracy on GSM8K and 72.0% on MATH when selecting the best response from 256 random generations. However, the model's accuracy drops significantly when only one generation is considered, highlighting an instability issue. By scaling up supervised fine-tuning (SFT) data, the model's performance improves, with synthetic data proving nearly as effective as real data. Scaling synthetic data to 960K on GSM8K and 480K on MATH significantly enhances accuracy, surpassing previous models. The model's performance on GSM8K and MATH benchmarks is comparable to or exceeds that of GPT-4. The study also reveals that scaling SFT data improves the model's stability and accuracy in solving mathematical problems, with synthetic data being a viable and effective alternative to real data. The results demonstrate that common language models can achieve strong mathematical capabilities without specialized pre-training, and that scaling SFT data with synthetic questions is an effective approach to enhance performance.Common 7B language models, such as LLaMA-2 7B, already demonstrate strong mathematical capabilities, as shown by their high accuracy on the GSM8K and MATH benchmarks. The LLaMA-2 7B model achieves 97.7% accuracy on GSM8K and 72.0% on MATH when selecting the best response from 256 random generations. However, the model's accuracy drops significantly when only one generation is considered, highlighting an instability issue. By scaling up supervised fine-tuning (SFT) data, the model's performance improves, with synthetic data proving nearly as effective as real data. Scaling synthetic data to 960K on GSM8K and 480K on MATH significantly enhances accuracy, surpassing previous models. The model's performance on GSM8K and MATH benchmarks is comparable to or exceeds that of GPT-4. The study also reveals that scaling SFT data improves the model's stability and accuracy in solving mathematical problems, with synthetic data being a viable and effective alternative to real data. The results demonstrate that common language models can achieve strong mathematical capabilities without specialized pre-training, and that scaling SFT data with synthetic questions is an effective approach to enhance performance.