Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems

18 Nov 2021 | Karl Cobbe*, Vineet Kosaraju*, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman
This paper introduces GSM8K, a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. The authors find that even the largest transformer models struggle to perform well on this dataset, despite its conceptual simplicity. To improve performance, they propose training verifiers to judge the correctness of model completions. At test time, they generate multiple candidate solutions and select the one ranked highest by the verifier. The results show that verification significantly improves performance on GSM8K, and that verification scales more effectively with increased data than a fine-tuning baseline. The dataset is designed to have high linguistic diversity while relying on relatively simple grade school math concepts. It includes problems that require between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations. A bright middle school student should be able to solve every problem. The authors propose training verifiers to evaluate the correctness of model-generated solutions. They find that verification significantly improves performance on GSM8K, and that verification scales more effectively with increased data than a fine-tuning baseline. They also find that dropout acts as a strong regularizer, significantly improving performance for both fine-tuning and verification. The authors compare the performance of different model sizes and find that verification provides a significant performance boost relative to a fine-tuning baseline. On the full dataset, 6B verification slightly outperforms a fine-tuned 175B model, thereby offering a boost approximately equivalent to a 30x model size increase. They also find that token-level verifiers are less prone to overfitting than solution-level verifiers, and that all methods benefit from regularization with residual dropout. The authors also find that verification is still remarkably effective, even when the verifier is much smaller than the generator. This suggests that the verifier may often be relying on relatively coarse heuristics to discriminate between solutions from a given generator, rather than attempting a more thorough form of verification. They also find that verification is more effective when the number of completions per problem is increased, and that majority voting among the top-ranked solutions can further improve performance.This paper introduces GSM8K, a dataset of 8,500 high-quality, linguistically diverse grade school math word problems. The authors find that even the largest transformer models struggle to perform well on this dataset, despite its conceptual simplicity. To improve performance, they propose training verifiers to judge the correctness of model completions. At test time, they generate multiple candidate solutions and select the one ranked highest by the verifier. The results show that verification significantly improves performance on GSM8K, and that verification scales more effectively with increased data than a fine-tuning baseline. The dataset is designed to have high linguistic diversity while relying on relatively simple grade school math concepts. It includes problems that require between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations. A bright middle school student should be able to solve every problem. The authors propose training verifiers to evaluate the correctness of model-generated solutions. They find that verification significantly improves performance on GSM8K, and that verification scales more effectively with increased data than a fine-tuning baseline. They also find that dropout acts as a strong regularizer, significantly improving performance for both fine-tuning and verification. The authors compare the performance of different model sizes and find that verification provides a significant performance boost relative to a fine-tuning baseline. On the full dataset, 6B verification slightly outperforms a fine-tuned 175B model, thereby offering a boost approximately equivalent to a 30x model size increase. They also find that token-level verifiers are less prone to overfitting than solution-level verifiers, and that all methods benefit from regularization with residual dropout. The authors also find that verification is still remarkably effective, even when the verifier is much smaller than the generator. This suggests that the verifier may often be relying on relatively coarse heuristics to discriminate between solutions from a given generator, rather than attempting a more thorough form of verification. They also find that verification is more effective when the number of completions per problem is increased, and that majority voting among the top-ranked solutions can further improve performance.
Reach us at info@study.space
Understanding Training Verifiers to Solve Math Word Problems