18 Nov 2021 | Karl Cobbe*, Vineet Kosaraju*, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman
The paper "Training Verifiers to Solve Math Word Problems" by OpenAI introduces GSM8K, a dataset of 8.5K high-quality, linguistically diverse grade school math word problems. The authors find that even large language models struggle with multi-step mathematical reasoning, despite their impressive performance on other tasks. To address this, they propose training verifiers to judge the correctness of model-generated solutions. At test time, the model generates multiple candidate solutions and selects the one ranked highest by the verifier. The study demonstrates that verification significantly improves performance on GSM8K and scales more effectively with increased data compared to a finetuning baseline. The main contributions include the creation of GSM8K, showing that verification can boost performance by a factor equivalent to a 30x increase in model size, and demonstrating that dropout acts as a strong regularizer. The paper also explores various ablations and provides insights into the effectiveness of different verification methods and regularization techniques.The paper "Training Verifiers to Solve Math Word Problems" by OpenAI introduces GSM8K, a dataset of 8.5K high-quality, linguistically diverse grade school math word problems. The authors find that even large language models struggle with multi-step mathematical reasoning, despite their impressive performance on other tasks. To address this, they propose training verifiers to judge the correctness of model-generated solutions. At test time, the model generates multiple candidate solutions and selects the one ranked highest by the verifier. The study demonstrates that verification significantly improves performance on GSM8K and scales more effectively with increased data compared to a finetuning baseline. The main contributions include the creation of GSM8K, showing that verification can boost performance by a factor equivalent to a 30x increase in model size, and demonstrating that dropout acts as a strong regularizer. The paper also explores various ablations and provides insights into the effectiveness of different verification methods and regularization techniques.