Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

15 Jul 2024 | Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li
Step-Controlled DPO is a method that enhances mathematical reasoning in large language models (LLMs) by introducing stepwise error supervision. The approach generates dispreferred samples that start making errors at a specified step, allowing the model to learn from these errors during training. This method is applied to both code-integrated and chain-of-thought solutions, improving performance on mathematical problem-solving tasks. The method is evaluated on three different SFT models, including one existing model and two fine-tuned models, showing consistent improvements. When applied to an InternLM2-20B model, it achieves high scores on GSM8K (88.5%) and MATH (58.1%), rivaling other open-source models. The method provides detailed stepwise supervision, which helps the model better understand and correct reasoning errors. Qualitative analysis shows that SCDPO effectively identifies errors in mathematical solutions. The method is implemented with a pipeline that includes data generation and step-aware DPO training, and it is supported by extensive experiments and evaluations. The work addresses limitations in previous methods, such as the need for human annotation and the lack of detailed stepwise supervision. The results demonstrate the effectiveness and potential of SCDPO in enhancing mathematical reasoning in LLMs.Step-Controlled DPO is a method that enhances mathematical reasoning in large language models (LLMs) by introducing stepwise error supervision. The approach generates dispreferred samples that start making errors at a specified step, allowing the model to learn from these errors during training. This method is applied to both code-integrated and chain-of-thought solutions, improving performance on mathematical problem-solving tasks. The method is evaluated on three different SFT models, including one existing model and two fine-tuned models, showing consistent improvements. When applied to an InternLM2-20B model, it achieves high scores on GSM8K (88.5%) and MATH (58.1%), rivaling other open-source models. The method provides detailed stepwise supervision, which helps the model better understand and correct reasoning errors. Qualitative analysis shows that SCDPO effectively identifies errors in mathematical solutions. The method is implemented with a pipeline that includes data generation and step-aware DPO training, and it is supported by extensive experiments and evaluations. The work addresses limitations in previous methods, such as the need for human annotation and the lack of detailed stepwise supervision. The results demonstrate the effectiveness and potential of SCDPO in enhancing mathematical reasoning in LLMs.
Reach us at info@study.space
Understanding Step-Controlled DPO%3A Leveraging Stepwise Error for Enhanced Mathematical Reasoning