Feb 5, 2024 | Zihan Wang, Yunxuan Li, Yuexin Wu, Liangchen Luo, Le Hou, Hongkun Yu, Jingbo Shang
This paper introduces Model-induced Process Supervision (MiPS), a novel method for automatically curating process supervision data to train verifiers that evaluate intermediate steps in multi-step problem solving. Traditional process supervision requires human annotations, which is costly and time-consuming. MiPS instead uses a reasoning model to generate multiple solutions for each problem, then evaluates the correctness of intermediate steps by sampling completions of these solutions. The accuracy of each intermediate step is defined as the proportion of correct completions, which is used to annotate the step-wise solutions. This method allows for the automatic generation of process supervision data without human intervention.
The paper evaluates MiPS on math and coding tasks, showing that it significantly improves the performance of PaLM 2 on math and coding tasks compared to an output verifier. For example, on GSM8K, MiPS improves accuracy by +0.67%, on MATH by +4.16%, and on MBPP by +0.92%. The study also demonstrates that verifiers trained on MiPS data exhibit strong generalization ability across different reasoning models.
The paper analyzes the effectiveness of different aggregation functions for combining step-wise predictions into a single score. It finds that aggregation functions that focus on high predicted scores perform better than those that focus on low predicted scores, contrary to prior work. It also shows that using only the last step's predicted score can sometimes be more effective than using output supervision.
The paper further explores the impact of noise in MiPS data on the performance of process supervised verifiers. It finds that earlier steps in MiPS data are more noisy, which can negatively affect the performance of verifiers. However, using the last step's predicted score can mitigate this issue.
Finally, the paper shows that verifiers trained on MiPS data can transfer to validate solutions generated by different reasoning models, indicating that MiPS data does not produce verifiers that are overly biased towards the mistakes of the reasoning model that generated the data. The study concludes that MiPS provides a scalable and effective way to generate process supervision data for training verifiers that can improve the performance of multi-step problem solving.This paper introduces Model-induced Process Supervision (MiPS), a novel method for automatically curating process supervision data to train verifiers that evaluate intermediate steps in multi-step problem solving. Traditional process supervision requires human annotations, which is costly and time-consuming. MiPS instead uses a reasoning model to generate multiple solutions for each problem, then evaluates the correctness of intermediate steps by sampling completions of these solutions. The accuracy of each intermediate step is defined as the proportion of correct completions, which is used to annotate the step-wise solutions. This method allows for the automatic generation of process supervision data without human intervention.
The paper evaluates MiPS on math and coding tasks, showing that it significantly improves the performance of PaLM 2 on math and coding tasks compared to an output verifier. For example, on GSM8K, MiPS improves accuracy by +0.67%, on MATH by +4.16%, and on MBPP by +0.92%. The study also demonstrates that verifiers trained on MiPS data exhibit strong generalization ability across different reasoning models.
The paper analyzes the effectiveness of different aggregation functions for combining step-wise predictions into a single score. It finds that aggregation functions that focus on high predicted scores perform better than those that focus on low predicted scores, contrary to prior work. It also shows that using only the last step's predicted score can sometimes be more effective than using output supervision.
The paper further explores the impact of noise in MiPS data on the performance of process supervised verifiers. It finds that earlier steps in MiPS data are more noisy, which can negatively affect the performance of verifiers. However, using the last step's predicted score can mitigate this issue.
Finally, the paper shows that verifiers trained on MiPS data can transfer to validate solutions generated by different reasoning models, indicating that MiPS data does not produce verifiers that are overly biased towards the mistakes of the reasoning model that generated the data. The study concludes that MiPS provides a scalable and effective way to generate process supervision data for training verifiers that can improve the performance of multi-step problem solving.