AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition

AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition

18 Feb 2024 | Zhaorun Chen1, Zhuokai Zhao1, Zhihong Zhu2, Ruiqi Zhang3 Xiang Li4, Bhiksha Raj4, Huaxiu Yao5
AutoPRM is a novel self-supervised framework designed to enhance the fine-tuning of large language models (LLMs) for complex reasoning tasks. The framework addresses the challenge of extensive manual labeling required for procedural feedback by automating the process of question decomposition and using reinforcement learning to iteratively improve the subquestion solver. Specifically, AutoPRM decomposes complex problems into manageable subquestions with controllable granularity, and then applies reinforcement learning to optimize the subquestion solver. Additionally, it introduces context-guided decoding to avoid reward tampering and guide the subquestion solver towards solving the holistic problem. Extensive experiments on arithmetic and commonsense reasoning datasets demonstrate that AutoPRM significantly improves performance over state-of-the-art (SOTA) methods, while being more efficient and scalable. The framework can be easily integrated with other reasoning pipelines, making it a versatile tool for enhancing LLMs' reasoning capabilities.AutoPRM is a novel self-supervised framework designed to enhance the fine-tuning of large language models (LLMs) for complex reasoning tasks. The framework addresses the challenge of extensive manual labeling required for procedural feedback by automating the process of question decomposition and using reinforcement learning to iteratively improve the subquestion solver. Specifically, AutoPRM decomposes complex problems into manageable subquestions with controllable granularity, and then applies reinforcement learning to optimize the subquestion solver. Additionally, it introduces context-guided decoding to avoid reward tampering and guide the subquestion solver towards solving the holistic problem. Extensive experiments on arithmetic and commonsense reasoning datasets demonstrate that AutoPRM significantly improves performance over state-of-the-art (SOTA) methods, while being more efficient and scalable. The framework can be easily integrated with other reasoning pipelines, making it a versatile tool for enhancing LLMs' reasoning capabilities.
Reach us at info@study.space