The paper "Step-DPO: Step-wise Preference Optimization for Long-Chain Reasoning of LLMs" addresses the challenge of enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) by focusing on long-chain reasoning tasks. The authors propose a method called Step-DPO, which treats individual reasoning steps as units for preference optimization, rather than evaluating answers holistically. This approach allows models to identify and correct detailed errors in incorrect answers more effectively.
Key contributions of the paper include:
1. **Step-DPO**: A novel method that optimizes preference data at the step level, enabling models to pinpoint and rectify specific errors in reasoning steps.
2. **Data Construction Pipeline**: A three-step process to create a high-quality dataset containing 10K step-wise preference pairs, which is crucial for training Step-DPO.
3. **Experimental Results**: The paper demonstrates that Step-DPO significantly improves the accuracy of LLMs in solving mathematical problems, achieving nearly 3% gain in accuracy on the MATH dataset for models with over 70B parameters. Notably, Qwen2-72B-Instruct, a model fine-tuned with Step-DPO, achieves scores of 70.8% and 94.0% on the MATH and GSM8K test sets, respectively, surpassing several closed-source models.
The authors also highlight that in-distribution data, generated by the model itself, is more effective than out-of-distribution data for Step-DPO, as it aligns better with the model's internal representations. The paper concludes by emphasizing the potential of Step-DPO in improving the robustness and factuality of LLMs in long-chain reasoning tasks.The paper "Step-DPO: Step-wise Preference Optimization for Long-Chain Reasoning of LLMs" addresses the challenge of enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) by focusing on long-chain reasoning tasks. The authors propose a method called Step-DPO, which treats individual reasoning steps as units for preference optimization, rather than evaluating answers holistically. This approach allows models to identify and correct detailed errors in incorrect answers more effectively.
Key contributions of the paper include:
1. **Step-DPO**: A novel method that optimizes preference data at the step level, enabling models to pinpoint and rectify specific errors in reasoning steps.
2. **Data Construction Pipeline**: A three-step process to create a high-quality dataset containing 10K step-wise preference pairs, which is crucial for training Step-DPO.
3. **Experimental Results**: The paper demonstrates that Step-DPO significantly improves the accuracy of LLMs in solving mathematical problems, achieving nearly 3% gain in accuracy on the MATH dataset for models with over 70B parameters. Notably, Qwen2-72B-Instruct, a model fine-tuned with Step-DPO, achieves scores of 70.8% and 94.0% on the MATH and GSM8K test sets, respectively, surpassing several closed-source models.
The authors also highlight that in-distribution data, generated by the model itself, is more effective than out-of-distribution data for Step-DPO, as it aligns better with the model's internal representations. The paper concludes by emphasizing the potential of Step-DPO in improving the robustness and factuality of LLMs in long-chain reasoning tasks.