14 Aug 2024 | Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal
V-STaR (Verification for Self-Taught Reasoners) is a novel approach that enhances the reasoning capabilities of large language models (LLMs) by utilizing both correct and incorrect solutions generated during the self-improvement process. Unlike existing methods that discard incorrect solutions, V-STaR trains a verifier using Direct Preference Optimization (DPO) to judge the correctness of model-generated solutions. This verifier is then used at inference time to select the best solution among multiple candidates. The iterative training process improves both the generator and the verifier, leading to significant performance gains on common benchmarks such as code generation and math reasoning. Empirical results show that V-STaR achieves 4% to 17% improvement in test accuracy over existing self-improvement and verification approaches, demonstrating its effectiveness in improving LLMs' reasoning abilities. The key contributions of V-STaR include its ability to utilize all generated solutions, including incorrect ones, and its superior performance compared to other self-improvement and verification methods.V-STaR (Verification for Self-Taught Reasoners) is a novel approach that enhances the reasoning capabilities of large language models (LLMs) by utilizing both correct and incorrect solutions generated during the self-improvement process. Unlike existing methods that discard incorrect solutions, V-STaR trains a verifier using Direct Preference Optimization (DPO) to judge the correctness of model-generated solutions. This verifier is then used at inference time to select the best solution among multiple candidates. The iterative training process improves both the generator and the verifier, leading to significant performance gains on common benchmarks such as code generation and math reasoning. Empirical results show that V-STaR achieves 4% to 17% improvement in test accuracy over existing self-improvement and verification approaches, demonstrating its effectiveness in improving LLMs' reasoning abilities. The key contributions of V-STaR include its ability to utilize all generated solutions, including incorrect ones, and its superior performance compared to other self-improvement and verification methods.