Understanding Curry-DPO%3A Enhancing Alignment using Curriculum Learning %26 Ranked Preferences

**Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences** **Pulkit Pattanaik, Rishabh Maheshwary, Kelechi Ogueji, Vikas Yadav, Sathwik Tejeswi, Madhusudhan** **ServiceNow** **Abstract:** Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data to align Large Language Models (LLMs) with human preferences. However, multiple responses can exist for a given prompt, varying in quality. This paper proposes Curry-DPO, which utilizes these responses to create multiple preference pairs for each prompt. The method focuses on aligning LLMs by systematically curating and presenting these preference pairs in a meaningful manner through curriculum learning. The preference pairs are ordered from easy to hard based on various criteria, emulating curriculum learning. Detailed comparisons with the standard single pair DPO setting show that Curry-DPO consistently achieves higher performance gains on benchmarks such as MTbench, Vicuna bench, WizardLM, and the UltraFeedback test set. Notably, Curry-DPO achieves a score of 7.43 on MTbench with Zephyr-7B, outperforming most existing LLMs with similar parameter sizes. It also achieves the highest win rates on Vicuna, WizardLM, and UltraFeedback test sets (90.7%, 87.1%, and 87.9% respectively), with gains of up to 7.5% compared to standard DPO. **Introduction:** Recent advancements in instruction fine-tuning (IFT) and reinforcement learning from human feedback (RLHF) have significantly improved LLMs. Aligning LLMs with curated human feedback is crucial for steering their response behavior. DPO, a closed-form method, directly uses preferences to fine-tune LLMs using a supervised logistic loss. While DPO has shown impressive performance, it is limited to a single pair of responses per prompt. This paper introduces Curry-DPO, which incorporates curriculum learning on multiple preference pairs into the DPO framework. The method demonstrates strong improvements over standard DPO on various benchmarks, achieving the best MTbench score of 7.43 and adjusted win-rates of 87.9% on UltraFeedback and 87.1% on WizardLM. **Related Work:** The paper reviews existing methods for aligning LLMs with human preferences, including DPO, RLHF, and variants like KTO and Identity Preference Optimization. It also discusses curriculum learning, which has been applied to various NLP tasks but not to DPO. **Approach:** The approach involves sampling and arranging multiple preference pairs for curriculum learning. The paper explains the methodologies for creating and curating these preference pairs, and the training methodology, which includes iterative DPO with a reference model from the previous iteration. **Experimental Setup:** The experiments use the UltraFeedback and OpenAssistant datasets,**Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences** **Pulkit Pattanaik, Rishabh Maheshwary, Kelechi Ogueji, Vikas Yadav, Sathwik Tejeswi, Madhusudhan** **ServiceNow** **Abstract:** Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data to align Large Language Models (LLMs) with human preferences. However, multiple responses can exist for a given prompt, varying in quality. This paper proposes Curry-DPO, which utilizes these responses to create multiple preference pairs for each prompt. The method focuses on aligning LLMs by systematically curating and presenting these preference pairs in a meaningful manner through curriculum learning. The preference pairs are ordered from easy to hard based on various criteria, emulating curriculum learning. Detailed comparisons with the standard single pair DPO setting show that Curry-DPO consistently achieves higher performance gains on benchmarks such as MTbench, Vicuna bench, WizardLM, and the UltraFeedback test set. Notably, Curry-DPO achieves a score of 7.43 on MTbench with Zephyr-7B, outperforming most existing LLMs with similar parameter sizes. It also achieves the highest win rates on Vicuna, WizardLM, and UltraFeedback test sets (90.7%, 87.1%, and 87.9% respectively), with gains of up to 7.5% compared to standard DPO. **Introduction:** Recent advancements in instruction fine-tuning (IFT) and reinforcement learning from human feedback (RLHF) have significantly improved LLMs. Aligning LLMs with curated human feedback is crucial for steering their response behavior. DPO, a closed-form method, directly uses preferences to fine-tune LLMs using a supervised logistic loss. While DPO has shown impressive performance, it is limited to a single pair of responses per prompt. This paper introduces Curry-DPO, which incorporates curriculum learning on multiple preference pairs into the DPO framework. The method demonstrates strong improvements over standard DPO on various benchmarks, achieving the best MTbench score of 7.43 and adjusted win-rates of 87.9% on UltraFeedback and 87.1% on WizardLM. **Related Work:** The paper reviews existing methods for aligning LLMs with human preferences, including DPO, RLHF, and variants like KTO and Identity Preference Optimization. It also discusses curriculum learning, which has been applied to various NLP tasks but not to DPO. **Approach:** The approach involves sampling and arranging multiple preference pairs for curriculum learning. The paper explains the methodologies for creating and curating these preference pairs, and the training methodology, which includes iterative DPO with a reference model from the previous iteration. **Experimental Setup:** The experiments use the UltraFeedback and OpenAssistant datasets,

Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

12 Mar 2024 | Pulkit Pattnaik, Rishabh Maheshwary, Kelechi Ogueji, Vikas Yadav, Sathwik Tejaswi Madhusudhan