Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

12 Mar 2024 | Pulkit Pattnaik, Rishabh Maheshwary, Kelechi Ogueji, Vikas Yadav, Sathwik Tejaswi Madhusudhan
Curry-DPO is a method that enhances alignment of large language models (LLMs) by using curriculum learning and ranked preferences. It leverages multiple preference pairs for a given prompt, systematically curating and ordering them to emulate curriculum learning. This approach improves performance on benchmarks like MTbench, Vicuna, WizardLM, and UltraFeedback. Curry-DPO outperforms standard DPO by achieving higher scores and win rates, with a notable 7.43 score on MTbench using Zephyr-7B. The method ranks preference pairs based on rating differences, starting with the easiest to hardest, and uses a reference model from previous iterations to enhance learning. Experiments show that Curry-DPO significantly improves performance across multiple benchmarks, with the best results achieved through iterative training and curriculum learning. The method is also effective in safety and helpfulness evaluations, generating safer and more helpful responses compared to baseline DPO. However, it is important to note that even with improved alignment, Curry-DPO can still generate harmful content and should be used with caution. The approach is orthogonal to many existing methods and can be extended to other preference optimization techniques. The work highlights the potential of curriculum learning in preference optimization and demonstrates the effectiveness of iterative training in improving LLM alignment.Curry-DPO is a method that enhances alignment of large language models (LLMs) by using curriculum learning and ranked preferences. It leverages multiple preference pairs for a given prompt, systematically curating and ordering them to emulate curriculum learning. This approach improves performance on benchmarks like MTbench, Vicuna, WizardLM, and UltraFeedback. Curry-DPO outperforms standard DPO by achieving higher scores and win rates, with a notable 7.43 score on MTbench using Zephyr-7B. The method ranks preference pairs based on rating differences, starting with the easiest to hardest, and uses a reference model from previous iterations to enhance learning. Experiments show that Curry-DPO significantly improves performance across multiple benchmarks, with the best results achieved through iterative training and curriculum learning. The method is also effective in safety and helpfulness evaluations, generating safer and more helpful responses compared to baseline DPO. However, it is important to note that even with improved alignment, Curry-DPO can still generate harmful content and should be used with caution. The approach is orthogonal to many existing methods and can be extended to other preference optimization techniques. The work highlights the potential of curriculum learning in preference optimization and demonstrates the effectiveness of iterative training in improving LLM alignment.
Reach us at info@study.space