V-STaR: Training Verifiers for Self-Taught Reasoners

V-STaR: Training Verifiers for Self-Taught Reasoners

2024 | Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal
V-STaR: Training Verifiers for Self-Taught Reasoners This paper proposes V-STaR, a method that improves the reasoning ability of large language models (LLMs) by training both a generator and a verifier using iteratively generated correct and incorrect solutions. The generator is trained on correct solutions, while the verifier is trained on both correct and incorrect solutions using Direct Preference Optimization (DPO). This approach allows the verifier to learn from the mistakes of the generator, leading to better reasoning performance. The key idea of V-STaR is to use both correct and incorrect solutions generated during the self-improvement process to train a verifier. This verifier is then used at inference time to select the best solution among multiple candidate solutions. The iterative self-improvement process results in progressively better generators and verifiers, leading to significant improvements in test accuracy on common code generation and math reasoning benchmarks. The paper evaluates V-STaR on two widely used datasets: GSM8K for math problems and MBPP for code generation. The results show that V-STaR outperforms existing self-improvement and verification approaches, achieving up to 17% improvement in math reasoning and 12% improvement in code generation. The method is also effective in out-of-domain transfer tasks, such as applying V-STaR to code generation tasks on the HumanEval dataset. The paper also compares V-STaR with other self-improvement and verification methods, including STaR, RFT, and ORM. It finds that DPO-based verifiers are more effective than ORM-based verifiers for training verifiers. Additionally, the paper proposes a formula for evaluating test performance with verification, similar to Pass@k. The paper concludes that V-STaR is a data-efficient and effective approach for improving the reasoning ability of LLMs. It uses both correct and incorrect solutions generated during the self-improvement process to train a verifier, which is then used at inference time to select the best solution. This approach leads to significant improvements in test accuracy on common code generation and math reasoning benchmarks.V-STaR: Training Verifiers for Self-Taught Reasoners This paper proposes V-STaR, a method that improves the reasoning ability of large language models (LLMs) by training both a generator and a verifier using iteratively generated correct and incorrect solutions. The generator is trained on correct solutions, while the verifier is trained on both correct and incorrect solutions using Direct Preference Optimization (DPO). This approach allows the verifier to learn from the mistakes of the generator, leading to better reasoning performance. The key idea of V-STaR is to use both correct and incorrect solutions generated during the self-improvement process to train a verifier. This verifier is then used at inference time to select the best solution among multiple candidate solutions. The iterative self-improvement process results in progressively better generators and verifiers, leading to significant improvements in test accuracy on common code generation and math reasoning benchmarks. The paper evaluates V-STaR on two widely used datasets: GSM8K for math problems and MBPP for code generation. The results show that V-STaR outperforms existing self-improvement and verification approaches, achieving up to 17% improvement in math reasoning and 12% improvement in code generation. The method is also effective in out-of-domain transfer tasks, such as applying V-STaR to code generation tasks on the HumanEval dataset. The paper also compares V-STaR with other self-improvement and verification methods, including STaR, RFT, and ORM. It finds that DPO-based verifiers are more effective than ORM-based verifiers for training verifiers. Additionally, the paper proposes a formula for evaluating test performance with verification, similar to Pass@k. The paper concludes that V-STaR is a data-efficient and effective approach for improving the reasoning ability of LLMs. It uses both correct and incorrect solutions generated during the self-improvement process to train a verifier, which is then used at inference time to select the best solution. This approach leads to significant improvements in test accuracy on common code generation and math reasoning benchmarks.
Reach us at info@study.space