Step-DPO is a method for enhancing the long-chain reasoning ability of large language models (LLMs) by focusing on individual reasoning steps rather than evaluating answers as a whole. This approach addresses the limitations of Direct Preference Optimization (DPO), which struggles to identify detailed errors in long-chain mathematical reasoning due to a lack of fine-grained process supervision. Step-DPO treats each reasoning step as a unit for preference optimization, enabling more precise error detection and correction. A data construction pipeline was developed to generate a high-quality dataset containing 10,000 step-wise preference pairs. The dataset was created through three stages: error collection, step localization, and rectification. The results show that Step-DPO significantly improves mathematical reasoning performance, with models like Qwen2-72B-Instruct achieving 70.8% accuracy on the MATH test set and 94.0% on the GSM8K test set. These results surpass those of several closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. The method is data-efficient and effective, requiring only 10,000 preference data pairs and fewer than 500 training steps for models with over 70B parameters to achieve a nearly 3% gain in accuracy on MATH. Step-DPO also demonstrates strong generalization capabilities, performing well on competition-level math problems such as AIME 2024 and Odyssey-MATH. The method is simple, effective, and data-efficient, making it a valuable approach for improving long-chain reasoning in LLMs.Step-DPO is a method for enhancing the long-chain reasoning ability of large language models (LLMs) by focusing on individual reasoning steps rather than evaluating answers as a whole. This approach addresses the limitations of Direct Preference Optimization (DPO), which struggles to identify detailed errors in long-chain mathematical reasoning due to a lack of fine-grained process supervision. Step-DPO treats each reasoning step as a unit for preference optimization, enabling more precise error detection and correction. A data construction pipeline was developed to generate a high-quality dataset containing 10,000 step-wise preference pairs. The dataset was created through three stages: error collection, step localization, and rectification. The results show that Step-DPO significantly improves mathematical reasoning performance, with models like Qwen2-72B-Instruct achieving 70.8% accuracy on the MATH test set and 94.0% on the GSM8K test set. These results surpass those of several closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. The method is data-efficient and effective, requiring only 10,000 preference data pairs and fewer than 500 training steps for models with over 70B parameters to achieve a nearly 3% gain in accuracy on MATH. Step-DPO also demonstrates strong generalization capabilities, performing well on competition-level math problems such as AIME 2024 and Odyssey-MATH. The method is simple, effective, and data-efficient, making it a valuable approach for improving long-chain reasoning in LLMs.