[slides] Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

The paper "Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning" by Tianduo Wang, Shichen Li, and Wei Lu explores methods to enhance the reasoning abilities of small-scale language models (LMs) for mathematical tasks. The authors propose a novel approach that integrates Direct Preference Optimization (DPO) into the self-training framework, aiming to improve the quality and diversity of chain-of-thought (CoT) reasoning. **Key Contributions:** 1. **DPO-augmented Self-Training (DPO-ST):** This method enhances the traditional self-training process by incorporating DPO, which leverages preference data to guide LMs towards more accurate and diverse CoT reasoning. 2. **Performance Improvement:** The approach significantly improves the reasoning performance of LMs on various mathematical reasoning tasks, including the GSM8K benchmark. 3. **Cost-Effectiveness:** DPO-ST offers a more cost-effective and scalable solution compared to relying on large, proprietary LMs like GPT-4. **Methodology:** - **Self-Training Framework:** The method starts with a warm-up stage where the base model is fine-tuned on labeled data. Subsequent iterations involve two steps: DPO and supervised fine-tuning (SFT). - **DPO Step:** This step generates a preference dataset by sampling rationales from the SFT model and classifying them as winning or losing based on their correctness. The model is then trained on this dataset to optimize the DPO objective. - **SFT Step:** The DPO-tuned model is used to generate pseudo-labeled data, which is combined with the original labeled dataset for further SFT. **Experiments:** - **Setup:** The experiments use Flan-T5 and Llama models as base models and evaluate them on the GSM8K benchmark and other math reasoning tasks. - **Results:** The proposed method outperforms existing baselines, demonstrating superior performance on both in-domain and out-of-domain tasks. - **Additional Findings:** The DPO step significantly improves the quality and diversity of generated pseudo-labels, and integrating an external calculator enhances the accuracy of arithmetic tasks. **Conclusion:** The paper presents a resource-efficient method that leverages self-training and DPO to enhance the reasoning capabilities of small-scale LMs. The approach not only improves performance but also reduces computational costs, making it a promising solution for mathematical reasoning tasks.The paper "Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning" by Tianduo Wang, Shichen Li, and Wei Lu explores methods to enhance the reasoning abilities of small-scale language models (LMs) for mathematical tasks. The authors propose a novel approach that integrates Direct Preference Optimization (DPO) into the self-training framework, aiming to improve the quality and diversity of chain-of-thought (CoT) reasoning. **Key Contributions:** 1. **DPO-augmented Self-Training (DPO-ST):** This method enhances the traditional self-training process by incorporating DPO, which leverages preference data to guide LMs towards more accurate and diverse CoT reasoning. 2. **Performance Improvement:** The approach significantly improves the reasoning performance of LMs on various mathematical reasoning tasks, including the GSM8K benchmark. 3. **Cost-Effectiveness:** DPO-ST offers a more cost-effective and scalable solution compared to relying on large, proprietary LMs like GPT-4. **Methodology:** - **Self-Training Framework:** The method starts with a warm-up stage where the base model is fine-tuned on labeled data. Subsequent iterations involve two steps: DPO and supervised fine-tuning (SFT). - **DPO Step:** This step generates a preference dataset by sampling rationales from the SFT model and classifying them as winning or losing based on their correctness. The model is then trained on this dataset to optimize the DPO objective. - **SFT Step:** The DPO-tuned model is used to generate pseudo-labeled data, which is combined with the original labeled dataset for further SFT. **Experiments:** - **Setup:** The experiments use Flan-T5 and Llama models as base models and evaluate them on the GSM8K benchmark and other math reasoning tasks. - **Results:** The proposed method outperforms existing baselines, demonstrating superior performance on both in-domain and out-of-domain tasks. - **Additional Findings:** The DPO step significantly improves the quality and diversity of generated pseudo-labels, and integrating an external calculator enhances the accuracy of arithmetic tasks. **Conclusion:** The paper presents a resource-efficient method that leverages self-training and DPO to enhance the reasoning capabilities of small-scale LMs. The approach not only improves performance but also reduces computational costs, making it a promising solution for mathematical reasoning tasks.

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

25 Jul 2024 | Tianduo Wang, Shichen Li, Wei Lu