[slides and audio] Improve Mathematical Reasoning in Language Models by Automated Process Supervision

The paper "Improve Mathematical Reasoning in Language Models by Automated Process Supervision" addresses the challenge of enhancing the mathematical reasoning capabilities of large language models (LLMs). The authors propose a novel algorithm called OmegaPRM, which is a divide-and-conquer style Monte Carlo Tree Search (MCTS) designed to efficiently collect process supervision data. This data is crucial for training Process Reward Models (PRMs), which provide more granular feedback on the correctness of intermediate steps in reasoning chains compared to Outcome Reward Models (ORMs). OmegaPRM efficiently identifies the first error in a Chain of Thought (CoT) by performing binary searches, balancing positive and negative examples. This method collects over 1.5 million high-quality process supervision annotations, which are used to train a PRM. The combined use of this automated process supervision with the weighted self-consistency algorithm significantly improves the mathematical reasoning performance of the Gemini Pro model, achieving a 69.4% success rate on the MATH benchmark, a 36% relative improvement over the base model. The paper also discusses the limitations of the method, such as the potential for noisy annotations and the need for human supervision in certain steps. Overall, OmegaPRM offers a cost-effective and efficient solution for enhancing LLMs' reasoning abilities in complex multi-step tasks.The paper "Improve Mathematical Reasoning in Language Models by Automated Process Supervision" addresses the challenge of enhancing the mathematical reasoning capabilities of large language models (LLMs). The authors propose a novel algorithm called OmegaPRM, which is a divide-and-conquer style Monte Carlo Tree Search (MCTS) designed to efficiently collect process supervision data. This data is crucial for training Process Reward Models (PRMs), which provide more granular feedback on the correctness of intermediate steps in reasoning chains compared to Outcome Reward Models (ORMs). OmegaPRM efficiently identifies the first error in a Chain of Thought (CoT) by performing binary searches, balancing positive and negative examples. This method collects over 1.5 million high-quality process supervision annotations, which are used to train a PRM. The combined use of this automated process supervision with the weighted self-consistency algorithm significantly improves the mathematical reasoning performance of the Gemini Pro model, achieving a 69.4% success rate on the MATH benchmark, a 36% relative improvement over the base model. The paper also discusses the limitations of the method, such as the potential for noisy annotations and the need for human supervision in certain steps. Overall, OmegaPRM offers a cost-effective and efficient solution for enhancing LLMs' reasoning abilities in complex multi-step tasks.

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

2024-05-22 | Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, Abhinav Rastogi