Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

13 Jun 2024 | Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi
This paper explores the impact of different aspects of preference-based learning on the performance of modern language models (LMs). The authors identify four core components: preference data, learning algorithm, reward model, and policy training prompts. They systematically investigate the effects of varying these components on downstream model performance and propose a recipe for effective preference-based learning. Key findings include: 1. **Preference Data**: Synthetic, diverse data annotated with per-aspect preferences performs best, improving instruction following and truthfulness by up to 8%. 2. **Learning Algorithm**: PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. 3. **Reward Model**: Increasing the size and scale of the reward model improves performance, but the impact is marginal across most evaluations. 4. **Policy Training Prompts**: Targeted prompts that match the test setting can further improve performance in domain-specific settings, but have limited effects on overall performance. The authors recommend using synthetic preference datasets and training with PPO using a large reward model, along with targeted prompts for specific tasks. They also release the code and models used for their experiments to facilitate further research.This paper explores the impact of different aspects of preference-based learning on the performance of modern language models (LMs). The authors identify four core components: preference data, learning algorithm, reward model, and policy training prompts. They systematically investigate the effects of varying these components on downstream model performance and propose a recipe for effective preference-based learning. Key findings include: 1. **Preference Data**: Synthetic, diverse data annotated with per-aspect preferences performs best, improving instruction following and truthfulness by up to 8%. 2. **Learning Algorithm**: PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. 3. **Reward Model**: Increasing the size and scale of the reward model improves performance, but the impact is marginal across most evaluations. 4. **Policy Training Prompts**: Targeted prompts that match the test setting can further improve performance in domain-specific settings, but have limited effects on overall performance. The authors recommend using synthetic preference datasets and training with PPO using a large reward model, along with targeted prompts for specific tasks. They also release the code and models used for their experiments to facilitate further research.
Reach us at info@study.space
[slides] Unpacking DPO and PPO%3A Disentangling Best Practices for Learning from Preference Feedback | StudySpace