6 Jun 2024 | Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao Dong, Jie Tang
The paper introduces ReST-MCTS*, a novel approach for LLM self-training that integrates process reward guidance with tree search MCTS*. This method aims to collect higher-quality reasoning traces and per-step values to train policy and reward models more effectively. Unlike traditional methods that rely on manual annotation for per-step rewards, ReST-MCTS* uses tree-search-based reinforcement learning to infer correct process rewards by estimating the probability that a step can lead to the correct answer. These inferred rewards serve dual purposes: they refine the process reward model and facilitate the selection of high-quality traces for policy model self-training. The authors demonstrate that their tree-search policy achieves higher accuracy compared to existing LLM reasoning baselines within the same search budget. They also show that using traces searched by this tree-search policy as training data can continuously enhance the language models over multiple iterations, outperforming other self-training algorithms such as ReST*EM and Self-Rewarding. The paper includes experimental results validating the effectiveness of ReST-MCTS* in various benchmarks and comparisons with other self-training approaches.The paper introduces ReST-MCTS*, a novel approach for LLM self-training that integrates process reward guidance with tree search MCTS*. This method aims to collect higher-quality reasoning traces and per-step values to train policy and reward models more effectively. Unlike traditional methods that rely on manual annotation for per-step rewards, ReST-MCTS* uses tree-search-based reinforcement learning to infer correct process rewards by estimating the probability that a step can lead to the correct answer. These inferred rewards serve dual purposes: they refine the process reward model and facilitate the selection of high-quality traces for policy model self-training. The authors demonstrate that their tree-search policy achieves higher accuracy compared to existing LLM reasoning baselines within the same search budget. They also show that using traces searched by this tree-search policy as training data can continuously enhance the language models over multiple iterations, outperforming other self-training algorithms such as ReST*EM and Self-Rewarding. The paper includes experimental results validating the effectiveness of ReST-MCTS* in various benchmarks and comparisons with other self-training approaches.