Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

15 Jul 2024 | Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan†, Hongsheng Li†
The paper introduces Step-Controlled DPO (SCDPO), a method that enhances the mathematical reasoning abilities of large language models (LLMs) by providing stepwise error supervision. SCDPO generates negative samples of mathematical reasoning rationales that start making errors at a specified step, which are then used in DPO training to improve the model's understanding of reasoning errors and output accurate reasoning steps. The method is applied to both code-integrated and chain-of-thought solutions, showing consistent improvements over naive DPO on three different SFT models. Qualitative analysis demonstrates SCDPO's effectiveness in identifying errors in mathematical solutions. The method is further tested on an InternLM2-20B model, achieving high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling other open-source LLMs. The paper also discusses the theoretical insights behind SCDPO and its limitations, highlighting the potential for future work in multimodal reasoning and pure code solutions.The paper introduces Step-Controlled DPO (SCDPO), a method that enhances the mathematical reasoning abilities of large language models (LLMs) by providing stepwise error supervision. SCDPO generates negative samples of mathematical reasoning rationales that start making errors at a specified step, which are then used in DPO training to improve the model's understanding of reasoning errors and output accurate reasoning steps. The method is applied to both code-integrated and chain-of-thought solutions, showing consistent improvements over naive DPO on three different SFT models. Qualitative analysis demonstrates SCDPO's effectiveness in identifying errors in mathematical solutions. The method is further tested on an InternLM2-20B model, achieving high scores of 88.5% on GSM8K and 58.1% on MATH, rivaling other open-source LLMs. The paper also discusses the theoretical insights behind SCDPO and its limitations, highlighting the potential for future work in multimodal reasoning and pure code solutions.
Reach us at info@study.space