17 Jun 2024 | Yuxi Xie, Anirudh Goyal, Wen Yue Zheng, Min-Yen Kan, Timothy Lillicrap, Kenji Kawaguchi, Michael Shieh
This paper introduces an approach to enhance the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by AlphaZero's strategy. The method leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, breaking down instance-level rewards into more granular step-level signals. The algorithm combines outcome validation and stepwise self-evaluation to continuously update the quality assessment of newly generated data. It uses Direct Preference Optimization (DPO) to update the LLM policy using this step-level preference data. Theoretical analysis shows the importance of using on-policy sampled data for successful self-improvement. Extensive evaluations on arithmetic and commonsense reasoning tasks demonstrate significant performance improvements. For instance, the approach outperforms the Mistral-7B Supervised Fine-Tuning (SFT) baseline on GSM8K, MATH, and ARC-C, with accuracy increases of 5.9%, 5.8%, and 15.8%, respectively. The method also explores the trade-off between training and inference compute, showing how it effectively maximizes performance gains. The code is publicly available at https://github.com/YuxiXie/MCTS-DPO. The approach is based on MCTS, which breaks down instance-level preference signals into step-level signals. MCTS allows the LLM to generate preference data instead of relying on predetermined human preference data, enabling real-time training signals. During training, sequences of text are generated on the fly and labeled via MCTS based on self-evaluation. DPO is used to update the LLM policy using the preference data. The method is evaluated on various arithmetic and commonsense reasoning tasks, showing significant performance improvements. The algorithm iteratively conducts step-level preference data sampling via MCTS and preference learning via DPO to update the policy. The approach is shown to be effective in enhancing LLM reasoning through iterative preference learning.This paper introduces an approach to enhance the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by AlphaZero's strategy. The method leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, breaking down instance-level rewards into more granular step-level signals. The algorithm combines outcome validation and stepwise self-evaluation to continuously update the quality assessment of newly generated data. It uses Direct Preference Optimization (DPO) to update the LLM policy using this step-level preference data. Theoretical analysis shows the importance of using on-policy sampled data for successful self-improvement. Extensive evaluations on arithmetic and commonsense reasoning tasks demonstrate significant performance improvements. For instance, the approach outperforms the Mistral-7B Supervised Fine-Tuning (SFT) baseline on GSM8K, MATH, and ARC-C, with accuracy increases of 5.9%, 5.8%, and 15.8%, respectively. The method also explores the trade-off between training and inference compute, showing how it effectively maximizes performance gains. The code is publicly available at https://github.com/YuxiXie/MCTS-DPO. The approach is based on MCTS, which breaks down instance-level preference signals into step-level signals. MCTS allows the LLM to generate preference data instead of relying on predetermined human preference data, enabling real-time training signals. During training, sequences of text are generated on the fly and labeled via MCTS based on self-evaluation. DPO is used to update the LLM policy using the preference data. The method is evaluated on various arithmetic and commonsense reasoning tasks, showing significant performance improvements. The algorithm iteratively conducts step-level preference data sampling via MCTS and preference learning via DPO to update the policy. The approach is shown to be effective in enhancing LLM reasoning through iterative preference learning.