DeepSeekMath 7B is a domain-specific language model designed to enhance mathematical reasoning capabilities. The model is pre-trained on a large corpus of 120B math-related tokens sourced from Common Crawl, combined with natural language and code data. It achieves impressive performance on the MATH benchmark, scoring 51.7% without relying on external toolkits or voting techniques, approaching the level of Gemini-Ultra and GPT-4. The model's effectiveness is attributed to two key factors: the use of publicly available web data through a meticulously engineered data selection pipeline, and the introduction of Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO) that enhances mathematical reasoning while optimizing memory usage.
The paper also explores the impact of code training on mathematical reasoning, finding that code training improves both program-aided and non-program-aided mathematical problem-solving abilities. Additionally, the study introduces GRPO, which reduces training resources by estimating the baseline from group scores instead of using a critic model. GRPO significantly enhances the performance of the instruction-tuned model DeepSeekMath-Instruct, achieving over 50% accuracy on the competition-level MATH dataset.
The evaluation covers English and Chinese benchmarks, demonstrating strong performance in both domains. The model also shows improvements in formal mathematics and general reasoning tasks, such as MMLU and BBH. The paper provides a unified paradigm to understand different reinforcement learning methods and discusses the effectiveness of reinforcement learning in boosting the performance of instruction-tuned models.DeepSeekMath 7B is a domain-specific language model designed to enhance mathematical reasoning capabilities. The model is pre-trained on a large corpus of 120B math-related tokens sourced from Common Crawl, combined with natural language and code data. It achieves impressive performance on the MATH benchmark, scoring 51.7% without relying on external toolkits or voting techniques, approaching the level of Gemini-Ultra and GPT-4. The model's effectiveness is attributed to two key factors: the use of publicly available web data through a meticulously engineered data selection pipeline, and the introduction of Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO) that enhances mathematical reasoning while optimizing memory usage.
The paper also explores the impact of code training on mathematical reasoning, finding that code training improves both program-aided and non-program-aided mathematical problem-solving abilities. Additionally, the study introduces GRPO, which reduces training resources by estimating the baseline from group scores instead of using a critic model. GRPO significantly enhances the performance of the instruction-tuned model DeepSeekMath-Instruct, achieving over 50% accuracy on the competition-level MATH dataset.
The evaluation covers English and Chinese benchmarks, demonstrating strong performance in both domains. The model also shows improvements in formal mathematics and general reasoning tasks, such as MMLU and BBH. The paper provides a unified paradigm to understand different reinforcement learning methods and discusses the effectiveness of reinforcement learning in boosting the performance of instruction-tuned models.