Step-level Value Preference Optimization for Mathematical Reasoning

Step-level Value Preference Optimization for Mathematical Reasoning

16 Jun 2024 | Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan
The paper introduces a novel algorithm called Step-level Value Preference Optimization (SVPO) to enhance the mathematical reasoning capabilities of large language models (LLMs). SVPO addresses the limitation of Direct Preference Optimization (DPO), which uses implicit reward models to fine-tune LLMs based on human feedback. DPO, while effective, does not capture the fine-grained quality of model outputs in complex multi-step reasoning tasks. SVPO employs Monte Carlo Tree Search (MCTS) to automatically annotate step-level preferences, providing more detailed insights into potential reasoning errors. Additionally, an explicit value model is trained to complement standard preference optimization, enabling the LLM to generate higher reward responses with minimal cost during inference. Experimental results demonstrate that SVPO achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks, outperforming existing methods and even GPT-4 in challenging datasets. The approach is computationally efficient and integrates well with step-level beam search for effective reasoning. The paper also discusses the impact of hyperparameters and the effectiveness of the value model in preference learning.The paper introduces a novel algorithm called Step-level Value Preference Optimization (SVPO) to enhance the mathematical reasoning capabilities of large language models (LLMs). SVPO addresses the limitation of Direct Preference Optimization (DPO), which uses implicit reward models to fine-tune LLMs based on human feedback. DPO, while effective, does not capture the fine-grained quality of model outputs in complex multi-step reasoning tasks. SVPO employs Monte Carlo Tree Search (MCTS) to automatically annotate step-level preferences, providing more detailed insights into potential reasoning errors. Additionally, an explicit value model is trained to complement standard preference optimization, enabling the LLM to generate higher reward responses with minimal cost during inference. Experimental results demonstrate that SVPO achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks, outperforming existing methods and even GPT-4 in challenging datasets. The approach is computationally efficient and integrates well with step-level beam search for effective reasoning. The paper also discusses the impact of hyperparameters and the effectiveness of the value model in preference learning.
Reach us at info@study.space