[slides and audio] Unpacking DPO and PPO%3A Disentangling Best Practices for Learning from Preference Feedback

This paper investigates the impact of various components in preference-based learning for improving the performance of language models (LMs). The authors identify four key aspects: preference data, learning algorithm, reward model, and policy training prompts. They systematically evaluate how each of these components affects downstream model performance. Their findings indicate that all aspects are important, with preference data having the largest impact, followed by the learning algorithm, reward model, and policy training prompts. PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. However, scaling up reward models only provides marginal improvements in other categories. The authors propose a recipe for strong learning from preference feedback: using synthetic preference datasets and training using PPO with a large reward model performs best overall. Additionally, targeted prompts should be used if one only cares about a single particular downstream task. The study also highlights the importance of using high-quality, diverse preference data and the effectiveness of PPO over DPO in improving model performance. The authors release the code and models used in their experiments for public access.This paper investigates the impact of various components in preference-based learning for improving the performance of language models (LMs). The authors identify four key aspects: preference data, learning algorithm, reward model, and policy training prompts. They systematically evaluate how each of these components affects downstream model performance. Their findings indicate that all aspects are important, with preference data having the largest impact, followed by the learning algorithm, reward model, and policy training prompts. PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. However, scaling up reward models only provides marginal improvements in other categories. The authors propose a recipe for strong learning from preference feedback: using synthetic preference datasets and training using PPO with a large reward model performs best overall. Additionally, targeted prompts should be used if one only cares about a single particular downstream task. The study also highlights the importance of using high-quality, diverse preference data and the effectiveness of PPO over DPO in improving model performance. The authors release the code and models used in their experiments for public access.

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

13 Jun 2024 | Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi