2024-05-22 | Liangchen Luo¹, Yinxiao Liu¹, Rosanne Liu¹, Samrat Phatale¹, Harsh Lara¹, Yunxuan Li², Lei Shu¹, Yun Zhu¹, Lei Meng², Jiao Sun² and Abhinav Rastogi¹
This paper introduces OmegaPRM, a novel divide-and-conquer Monte Carlo Tree Search (MCTS) algorithm designed to efficiently collect high-quality process supervision data for large language models (LLMs). The key challenge in improving LLM reasoning is the lack of effective process supervision data, which is typically costly to collect through human annotation or per-step Monte Carlo estimation. OmegaPRM addresses this by using binary search to quickly identify the first error in a Chain of Thought (CoT) solution, balancing positive and negative examples to ensure efficiency and quality. This approach enables the collection of over 1.5 million process supervision annotations, which are used to train a Process Reward Model (PRM). When combined with the weighted self-consistency algorithm, this method significantly improves the math reasoning performance of the Gemini Pro model, achieving a 69.4% success rate on the MATH benchmark, a 36% relative improvement over the base model. The entire process operates without human intervention, making it both financially and computationally cost-effective. The algorithm's efficiency is demonstrated through its ability to generate high-quality process supervision data, which is crucial for training PRMs that provide granular feedback on intermediate steps of reasoning. The paper also discusses the limitations of current methods, including the noise introduced by automated process supervision and the necessity of human supervision for certain tasks. Overall, OmegaPRM represents a significant advancement in the field of LLM reasoning by providing an efficient and effective method for collecting process supervision data.This paper introduces OmegaPRM, a novel divide-and-conquer Monte Carlo Tree Search (MCTS) algorithm designed to efficiently collect high-quality process supervision data for large language models (LLMs). The key challenge in improving LLM reasoning is the lack of effective process supervision data, which is typically costly to collect through human annotation or per-step Monte Carlo estimation. OmegaPRM addresses this by using binary search to quickly identify the first error in a Chain of Thought (CoT) solution, balancing positive and negative examples to ensure efficiency and quality. This approach enables the collection of over 1.5 million process supervision annotations, which are used to train a Process Reward Model (PRM). When combined with the weighted self-consistency algorithm, this method significantly improves the math reasoning performance of the Gemini Pro model, achieving a 69.4% success rate on the MATH benchmark, a 36% relative improvement over the base model. The entire process operates without human intervention, making it both financially and computationally cost-effective. The algorithm's efficiency is demonstrated through its ability to generate high-quality process supervision data, which is crucial for training PRMs that provide granular feedback on intermediate steps of reasoning. The paper also discusses the limitations of current methods, including the noise introduced by automated process supervision and the necessity of human supervision for certain tasks. Overall, OmegaPRM represents a significant advancement in the field of LLM reasoning by providing an efficient and effective method for collecting process supervision data.