Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

21 Apr 2024 | Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu
This paper investigates whether Direct Preference Optimization (DPO) is superior to Proximal Policy Optimization (PPO) for aligning large language models (LLMs) with human preferences. The study compares the two methods through both theoretical analysis and empirical experiments across various benchmarks. The results show that while DPO is a reward-free method that avoids training a reward model, it has fundamental limitations, particularly in handling out-of-distribution (OOD) data and distribution shifts between model outputs and preference data. In contrast, PPO, which uses a reward model, demonstrates superior performance in both academic benchmarks and challenging code generation tasks. The paper identifies key factors that enhance PPO's performance, including advantage normalization, large batch size training, and exponential moving average updates for the reference model. Through extensive experiments on dialogue, code generation, and other tasks, the study concludes that PPO consistently outperforms DPO, achieving state-of-the-art results in code competitions. The findings suggest that while DPO may be effective in some scenarios, PPO is more robust and effective in aligning LLMs with human preferences, especially in complex tasks. The paper also highlights the importance of mitigating distribution shifts and improving the quality of preference data to enhance the performance of both methods.This paper investigates whether Direct Preference Optimization (DPO) is superior to Proximal Policy Optimization (PPO) for aligning large language models (LLMs) with human preferences. The study compares the two methods through both theoretical analysis and empirical experiments across various benchmarks. The results show that while DPO is a reward-free method that avoids training a reward model, it has fundamental limitations, particularly in handling out-of-distribution (OOD) data and distribution shifts between model outputs and preference data. In contrast, PPO, which uses a reward model, demonstrates superior performance in both academic benchmarks and challenging code generation tasks. The paper identifies key factors that enhance PPO's performance, including advantage normalization, large batch size training, and exponential moving average updates for the reference model. Through extensive experiments on dialogue, code generation, and other tasks, the study concludes that PPO consistently outperforms DPO, achieving state-of-the-art results in code competitions. The findings suggest that while DPO may be effective in some scenarios, PPO is more robust and effective in aligning LLMs with human preferences, especially in complex tasks. The paper also highlights the importance of mitigating distribution shifts and improving the quality of preference data to enhance the performance of both methods.
Reach us at info@study.space