Small Language Models Need Strong Verifiers to Self-Correct Reasoning

Small Language Models Need Strong Verifiers to Self-Correct Reasoning

6 Jun 2024 | Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang
Small language models (LMs) require strong verifiers to self-correct reasoning. This paper explores whether small LMs can self-correct reasoning tasks with minimal input from stronger LMs. The authors propose a novel pipeline called SCORE, which prompts small LMs to generate self-correction data to train self-refinement abilities. The process involves using correct solutions to guide the model in critiquing incorrect responses, filtering these critiques for quality, and then using them for supervised fine-tuning of self-correcting reasoners. Experimental results show that the SCORE method improves self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier. However, the model struggles with self-correction when using a weak self-verifier. The main contributions include introducing SCORE, a pipeline to generate self-correction data from small LMs, and demonstrating the potential of small LMs to bootstrap self-corrective reasoning without distilling training data from stronger LMs. The study also highlights the importance of strong verifiers in enabling effective self-correction and shows that self-correction skills can transfer across different datasets. The results indicate that the self-correction performance is largely bottlenecked by the verifier rather than the refiner, and that combining a strong verifier with the SCORE method significantly improves final accuracy. The study also shows that the self-correction abilities of small LMs can be enhanced through fine-tuning with strong verifiers, and that the method is effective in improving reasoning performance. The paper concludes that strong verifiers are essential for unlocking the self-correction potential of small LMs.Small language models (LMs) require strong verifiers to self-correct reasoning. This paper explores whether small LMs can self-correct reasoning tasks with minimal input from stronger LMs. The authors propose a novel pipeline called SCORE, which prompts small LMs to generate self-correction data to train self-refinement abilities. The process involves using correct solutions to guide the model in critiquing incorrect responses, filtering these critiques for quality, and then using them for supervised fine-tuning of self-correcting reasoners. Experimental results show that the SCORE method improves self-correction abilities of two models on five datasets spanning math and commonsense reasoning, with notable performance gains when paired with a strong GPT-4-based verifier. However, the model struggles with self-correction when using a weak self-verifier. The main contributions include introducing SCORE, a pipeline to generate self-correction data from small LMs, and demonstrating the potential of small LMs to bootstrap self-corrective reasoning without distilling training data from stronger LMs. The study also highlights the importance of strong verifiers in enabling effective self-correction and shows that self-correction skills can transfer across different datasets. The results indicate that the self-correction performance is largely bottlenecked by the verifier rather than the refiner, and that combining a strong verifier with the SCORE method significantly improves final accuracy. The study also shows that the self-correction abilities of small LMs can be enhanced through fine-tuning with strong verifiers, and that the method is effective in improving reasoning performance. The paper concludes that strong verifiers are essential for unlocking the self-correction potential of small LMs.
Reach us at info@study.space