DeepSeekMath is a domain-specific language model that significantly outperforms open-source models in mathematical reasoning and approaches the performance level of GPT-4 on academic benchmarks. The model is trained on a large-scale, high-quality pre-training corpus of 120B math-related tokens sourced from Common Crawl, with a meticulously designed data selection pipeline. This corpus is filtered for mathematical content and includes both English and Chinese data, enhancing performance on Chinese mathematical benchmarks.
DeepSeekMath-Base 7B is initialized with DeepSeek-Coder-Base-v1.5 7B, which is more suitable for code training than general LLMs. The model achieves strong performance on benchmarks such as GSM8K and MATH, outperforming Minerva 540B. The model is further fine-tuned with chain-of-thought, program-of-thought, and tool-integrated reasoning data, resulting in DeepSeekMath-Instruct 7B, which surpasses all 7B counterparts and is comparable to 70B open-source instruction-tuned models.
Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), is introduced to enhance mathematical reasoning while reducing training resources. GRPO estimates the baseline from group scores, eliminating the need for a critic model. It significantly improves performance on in-domain and out-of-domain mathematical tasks, achieving 51.7% on the competition-level MATH benchmark without external tools.
The study also explores the effectiveness of reinforcement learning (RL) in improving instruction-tuned models. GRPO demonstrates strong performance in both in-domain and out-of-domain tasks, achieving 88.2% on GSM8K and 51.7% on MATH. The unified paradigm for understanding different methods, including Rejection Sampling Fine-Tuning (RFT), Direct Preference Optimization (DPO), PPO, and GRPO, reveals that these methods can be conceptualized as either direct or simplified RL techniques.
The research highlights the importance of code training in enhancing mathematical reasoning, both with and without tool use. However, arXiv papers do not significantly improve mathematical reasoning, as shown in experiments with different model sizes. The study also discusses the limitations of these findings and suggests further research to explore the impact of arXiv tokens on specific math-related tasks and larger model scales. Overall, the study provides valuable insights into the effectiveness of pre-training, reinforcement learning, and the importance of data selection in improving mathematical reasoning in language models.DeepSeekMath is a domain-specific language model that significantly outperforms open-source models in mathematical reasoning and approaches the performance level of GPT-4 on academic benchmarks. The model is trained on a large-scale, high-quality pre-training corpus of 120B math-related tokens sourced from Common Crawl, with a meticulously designed data selection pipeline. This corpus is filtered for mathematical content and includes both English and Chinese data, enhancing performance on Chinese mathematical benchmarks.
DeepSeekMath-Base 7B is initialized with DeepSeek-Coder-Base-v1.5 7B, which is more suitable for code training than general LLMs. The model achieves strong performance on benchmarks such as GSM8K and MATH, outperforming Minerva 540B. The model is further fine-tuned with chain-of-thought, program-of-thought, and tool-integrated reasoning data, resulting in DeepSeekMath-Instruct 7B, which surpasses all 7B counterparts and is comparable to 70B open-source instruction-tuned models.
Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), is introduced to enhance mathematical reasoning while reducing training resources. GRPO estimates the baseline from group scores, eliminating the need for a critic model. It significantly improves performance on in-domain and out-of-domain mathematical tasks, achieving 51.7% on the competition-level MATH benchmark without external tools.
The study also explores the effectiveness of reinforcement learning (RL) in improving instruction-tuned models. GRPO demonstrates strong performance in both in-domain and out-of-domain tasks, achieving 88.2% on GSM8K and 51.7% on MATH. The unified paradigm for understanding different methods, including Rejection Sampling Fine-Tuning (RFT), Direct Preference Optimization (DPO), PPO, and GRPO, reveals that these methods can be conceptualized as either direct or simplified RL techniques.
The research highlights the importance of code training in enhancing mathematical reasoning, both with and without tool use. However, arXiv papers do not significantly improve mathematical reasoning, as shown in experiments with different model sizes. The study also discusses the limitations of these findings and suggests further research to explore the impact of arXiv tokens on specific math-related tasks and larger model scales. Overall, the study provides valuable insights into the effectiveness of pre-training, reinforcement learning, and the importance of data selection in improving mathematical reasoning in language models.