This paper explores the effectiveness of Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO) in aligning large language models (LLMs) with human preferences. DPO, a reward-free method, and PPO, a reward-based method, are widely used in reinforcement learning from human feedback (RLHF). The study aims to answer two key questions: whether DPO is truly superior to PPO and why PPO performs poorly in academic benchmarks.
The paper begins with a theoretical and empirical analysis of DPO, revealing its fundamental limitations, such as the potential for biased solutions and sensitivity to distribution shift. Empirical results show that DPO's performance is significantly affected by the distribution between the model outputs and the preference dataset. To enhance PPO's performance, the paper identifies critical factors, including advantage normalization, large batch size, and exponential moving average updates for the reference model.
Extensive experiments across various RLHF testbeds, including dialogue and code generation tasks, demonstrate that PPO consistently outperforms DPO. Notably, PPO achieves state-of-the-art results in challenging code competitions, outperforming other alignment methods and even surpassing the performance of ChatGPT and Claude, which use PPO.
The paper concludes by highlighting the importance of addressing distribution shift and providing practical guidelines for improving PPO's performance. It also suggests future work on effectively training robust reward models and further exploring the limitations of DPO in complex tasks.This paper explores the effectiveness of Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO) in aligning large language models (LLMs) with human preferences. DPO, a reward-free method, and PPO, a reward-based method, are widely used in reinforcement learning from human feedback (RLHF). The study aims to answer two key questions: whether DPO is truly superior to PPO and why PPO performs poorly in academic benchmarks.
The paper begins with a theoretical and empirical analysis of DPO, revealing its fundamental limitations, such as the potential for biased solutions and sensitivity to distribution shift. Empirical results show that DPO's performance is significantly affected by the distribution between the model outputs and the preference dataset. To enhance PPO's performance, the paper identifies critical factors, including advantage normalization, large batch size, and exponential moving average updates for the reference model.
Extensive experiments across various RLHF testbeds, including dialogue and code generation tasks, demonstrate that PPO consistently outperforms DPO. Notably, PPO achieves state-of-the-art results in challenging code competitions, outperforming other alignment methods and even surpassing the performance of ChatGPT and Claude, which use PPO.
The paper concludes by highlighting the importance of addressing distribution shift and providing practical guidelines for improving PPO's performance. It also suggests future work on effectively training robust reward models and further exploring the limitations of DPO in complex tasks.