Understanding Direct Preference Optimization%3A Your Language Model is Secretly a Reward Model

The paper introduces Direct Preference Optimization (DPO), a novel method for training large language models (LMs) to align with human preferences without the need for reinforcement learning (RL). DPO directly optimizes a policy to satisfy preferences using a simple binary cross-entropy objective, eliminating the complexity and instability associated with RLHF (reinforcement learning from human feedback). The key insight is to leverage an analytical mapping from reward functions to optimal policies, allowing the optimal policy to be extracted in closed form. This approach avoids the need for explicit reward modeling and RL, making DPO simpler and more computationally efficient. Experiments show that DPO performs as well or better than existing methods, including PPO-based RLHF, in tasks such as sentiment modulation, summarization, and dialogue, with up to 6 billion parameters in the LM. DPO also demonstrates robustness to changes in sampling temperature and generalizes well to new input distributions. The paper provides theoretical analysis and discusses limitations and future directions, highlighting the potential of DPO for training more controllable and aligned AI systems.The paper introduces Direct Preference Optimization (DPO), a novel method for training large language models (LMs) to align with human preferences without the need for reinforcement learning (RL). DPO directly optimizes a policy to satisfy preferences using a simple binary cross-entropy objective, eliminating the complexity and instability associated with RLHF (reinforcement learning from human feedback). The key insight is to leverage an analytical mapping from reward functions to optimal policies, allowing the optimal policy to be extracted in closed form. This approach avoids the need for explicit reward modeling and RL, making DPO simpler and more computationally efficient. Experiments show that DPO performs as well or better than existing methods, including PPO-based RLHF, in tasks such as sentiment modulation, summarization, and dialogue, with up to 6 billion parameters in the LM. DPO also demonstrates robustness to changes in sampling temperature and generalizes well to new input distributions. The paper provides theoretical analysis and discusses limitations and future directions, highlighting the potential of DPO for training more controllable and aligned AI systems.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

29 Jul 2024 | Rafael Rafaïlov, Archit Sharma, Eric Mitchell*, Stefano Ermon†‡, Christopher D. Manning†, Chelsea Finn†

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

29 Jul 2024 | Rafael Rafaïlov*, Archit Sharma*, Eric Mitchell*, Stefano Ermon†‡, Christopher D. Manning†, Chelsea Finn†

29 Jul 2024 | Rafael Rafaïlov, Archit Sharma, Eric Mitchell*, Stefano Ermon†‡, Christopher D. Manning†, Chelsea Finn†