16 Jun 2024 | Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan*
Step-level Value Preference Optimization (SVPO) is a novel approach for improving mathematical reasoning in large language models (LLMs). It addresses the limitations of existing preference optimization methods by introducing step-level preference annotations through Monte Carlo Tree Search (MCTS). Unlike traditional methods that rely on solution-level preferences, SVPO generates detailed, granular preferences at each step of the reasoning process. This allows the model to better identify and correct reasoning errors during inference. SVPO also integrates an explicit value model that complements the implicit reward model used in Direct Preference Optimization (DPO), enabling the LLM to generate higher reward responses with minimal cost. The value model is trained using both Q-values from MCTS and step-level preference relationships, which helps in aligning the model's preferences with its actual reasoning capabilities. Experimental results show that SVPO achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks, outperforming existing methods such as GPT-4 on 7B LLMs. The approach is computationally efficient and effective in enhancing the reasoning capabilities of LLMs through step-level preference learning.Step-level Value Preference Optimization (SVPO) is a novel approach for improving mathematical reasoning in large language models (LLMs). It addresses the limitations of existing preference optimization methods by introducing step-level preference annotations through Monte Carlo Tree Search (MCTS). Unlike traditional methods that rely on solution-level preferences, SVPO generates detailed, granular preferences at each step of the reasoning process. This allows the model to better identify and correct reasoning errors during inference. SVPO also integrates an explicit value model that complements the implicit reward model used in Direct Preference Optimization (DPO), enabling the LLM to generate higher reward responses with minimal cost. The value model is trained using both Q-values from MCTS and step-level preference relationships, which helps in aligning the model's preferences with its actual reasoning capabilities. Experimental results show that SVPO achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks, outperforming existing methods such as GPT-4 on 7B LLMs. The approach is computationally efficient and effective in enhancing the reasoning capabilities of LLMs through step-level preference learning.