Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

10 Jul 2024 | Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He
This paper introduces Dr. DPO, a novel framework that enhances the robustness of Direct Preference Optimization (DPO) against both pointwise and pairwise noise in training data. The study addresses the challenge of noise in training datasets for DPO, a method for aligning Large Language Models (LLMs) with human preferences. The authors categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations that affect preference rankings. By leveraging Distributionally Robust Optimization (DRO), they enhance DPO's resilience to these types of noise. Theoretical insights reveal that DPO inherently embeds DRO principles, conferring robustness to pointwise noise, with the regularization coefficient β playing a critical role in its noise resistance. Extending this framework, they introduce Dr. DPO, which integrates pairwise robustness by optimizing against worst-case pairwise scenarios. The novel hyperparameter β' in Dr. DPO allows for fine-tuned control over data pair reliability, providing a strategic balance between exploration and exploitation in noisy training environments. Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings. The code is available at https://github.com/junkangwu/Dr_DPO.This paper introduces Dr. DPO, a novel framework that enhances the robustness of Direct Preference Optimization (DPO) against both pointwise and pairwise noise in training data. The study addresses the challenge of noise in training datasets for DPO, a method for aligning Large Language Models (LLMs) with human preferences. The authors categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations that affect preference rankings. By leveraging Distributionally Robust Optimization (DRO), they enhance DPO's resilience to these types of noise. Theoretical insights reveal that DPO inherently embeds DRO principles, conferring robustness to pointwise noise, with the regularization coefficient β playing a critical role in its noise resistance. Extending this framework, they introduce Dr. DPO, which integrates pairwise robustness by optimizing against worst-case pairwise scenarios. The novel hyperparameter β' in Dr. DPO allows for fine-tuned control over data pair reliability, providing a strategic balance between exploration and exploitation in noisy training environments. Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings. The code is available at https://github.com/junkangwu/Dr_DPO.
Reach us at info@study.space
[slides] Towards Robust Alignment of Language Models%3A Distributionally Robustifying Direct Preference Optimization | StudySpace