This paper proposes a novel method called DPO-augmented Self-Training (DPO-ST) to improve the reasoning abilities of small-scale language models (LMs) for mathematical reasoning tasks. The method integrates Direct Preference Optimization (DPO) into the self-training framework, allowing models to learn from their own outputs and preference data to enhance chain-of-thought reasoning. The approach is evaluated across various mathematical reasoning tasks using different base models, demonstrating significant improvements in reasoning performance while reducing computational costs compared to relying on large proprietary LMs.
The proposed method consists of two main steps: a warm-up stage and an iterative process. In the warm-up stage, the base model is fine-tuned on labeled data to generate an initial SFT model. The iterative process involves two sub-steps: the DPO step and the SFT step. In the DPO step, the model is trained to optimize the preference objective, generating high-quality pseudo-labels. In the SFT step, the model is further fine-tuned using the generated pseudo-labels to improve its reasoning capabilities.
The method also incorporates an external calculator to enhance the arithmetic capabilities of smaller LMs, allowing them to perform complex calculations more accurately. This is achieved by leveraging the calculator annotations provided in the GSM8K dataset, which are used during the decoding process to trigger the calculator and override the model's output tokens.
Experiments show that the proposed method outperforms existing baselines, including Supervised Fine-Tuning (SFT) and Self-Training (ST), on various mathematical reasoning tasks. The method achieves significant improvements in both in-domain (GSM8K) and out-of-domain (MultiArith, ASDiv, and SVAMP) tasks. Additionally, the method demonstrates remarkable resource efficiency, requiring minimal computational resources and human-labeled data.
The paper also discusses the limitations of the proposed method, including the use of unlabeled data and the generalization to other tasks. Future research could explore ways to collect high-quality unlabeled data for math word problem solving and extend the method to a wider range of reasoning tasks. The method is conceptually orthogonal to knowledge distillation techniques, and integrating knowledge distillation into the iterative training process could further enhance model performance.This paper proposes a novel method called DPO-augmented Self-Training (DPO-ST) to improve the reasoning abilities of small-scale language models (LMs) for mathematical reasoning tasks. The method integrates Direct Preference Optimization (DPO) into the self-training framework, allowing models to learn from their own outputs and preference data to enhance chain-of-thought reasoning. The approach is evaluated across various mathematical reasoning tasks using different base models, demonstrating significant improvements in reasoning performance while reducing computational costs compared to relying on large proprietary LMs.
The proposed method consists of two main steps: a warm-up stage and an iterative process. In the warm-up stage, the base model is fine-tuned on labeled data to generate an initial SFT model. The iterative process involves two sub-steps: the DPO step and the SFT step. In the DPO step, the model is trained to optimize the preference objective, generating high-quality pseudo-labels. In the SFT step, the model is further fine-tuned using the generated pseudo-labels to improve its reasoning capabilities.
The method also incorporates an external calculator to enhance the arithmetic capabilities of smaller LMs, allowing them to perform complex calculations more accurately. This is achieved by leveraging the calculator annotations provided in the GSM8K dataset, which are used during the decoding process to trigger the calculator and override the model's output tokens.
Experiments show that the proposed method outperforms existing baselines, including Supervised Fine-Tuning (SFT) and Self-Training (ST), on various mathematical reasoning tasks. The method achieves significant improvements in both in-domain (GSM8K) and out-of-domain (MultiArith, ASDiv, and SVAMP) tasks. Additionally, the method demonstrates remarkable resource efficiency, requiring minimal computational resources and human-labeled data.
The paper also discusses the limitations of the proposed method, including the use of unlabeled data and the generalization to other tasks. Future research could explore ways to collect high-quality unlabeled data for math word problem solving and extend the method to a wider range of reasoning tasks. The method is conceptually orthogonal to knowledge distillation techniques, and integrating knowledge distillation into the iterative training process could further enhance model performance.