Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

17 Jun 2024 | Yuxi Xie1*, Anirudh Goyal1, Wenyue Zheng1, Min-Yen Kan1, Timothy Lillicrap2, Kenji Kawaguchi1, Michael Shieh1
The paper introduces an approach to enhance the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by AlphaZero. The method leverages Monte Carlo Tree Search (MCTS) to collect preference data at a step-level granularity, breaking down instance-level rewards into more granular signals. This approach combines outcome validation and stepwise self-evaluation to ensure consistency in intermediate steps and update the quality assessment of newly generated data. Direct Preference Optimization (DPO) is used to update the LLM policy using the step-level preference data. Theoretical analysis highlights the importance of using on-policy sampled data for successful self-improvement. Extensive evaluations on various arithmetic and commonsense reasoning tasks demonstrate significant performance improvements over existing models, with substantial increases in accuracy on datasets like GSM8K, MATH, and ARC-C. The research also explores the training and inference compute tradeoff, showing how the method effectively maximizes performance gains. The code for the proposed method is publicly available.The paper introduces an approach to enhance the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by AlphaZero. The method leverages Monte Carlo Tree Search (MCTS) to collect preference data at a step-level granularity, breaking down instance-level rewards into more granular signals. This approach combines outcome validation and stepwise self-evaluation to ensure consistency in intermediate steps and update the quality assessment of newly generated data. Direct Preference Optimization (DPO) is used to update the LLM policy using the step-level preference data. Theoretical analysis highlights the importance of using on-policy sampled data for successful self-improvement. Extensive evaluations on various arithmetic and commonsense reasoning tasks demonstrate significant performance improvements over existing models, with substantial increases in accuracy on datasets like GSM8K, MATH, and ARC-C. The research also explores the training and inference compute tradeoff, showing how the method effectively maximizes performance gains. The code for the proposed method is publicly available.
Reach us at info@study.space
[slides] Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning | StudySpace