2023 | Rafael Rafailov*, Archit Sharma*, Eric Mitchell†, Stefano Ermon††, Christopher D. Manning†, Chelsea Finn†
Direct Preference Optimization (DPO) is a new method for training language models from human preferences without reinforcement learning. Unlike existing methods that require training a reward model and then using reinforcement learning to optimize a policy, DPO directly optimizes the policy to satisfy human preferences using a simple classification objective. This approach eliminates the need for sampling from the language model during fine-tuning or significant hyperparameter tuning. DPO is stable, performant, and computationally lightweight, and has been shown to fine-tune language models to align with human preferences as well as or better than existing methods. Notably, DPO exceeds PPO-based RLHF in its ability to control sentiment of generations and matches or improves response quality in summarization and single-turn dialogue. DPO is also simpler to implement and train. The method is based on a theoretical preference model, such as the Bradley-Terry model, which measures how well a given reward function aligns with empirical preference data. DPO uses a change of variables to define the preference loss as a function of the policy directly. Given a dataset of human preferences over model responses, DPO can optimize a policy using a simple binary cross entropy objective, producing the optimal policy to an implicit reward function fit to the preference data. DPO is a simple RL-free algorithm for training language models from preferences. Our experiments show that DPO is at least as effective as existing methods, including PPO-based RLHF, for learning from preferences in tasks such as sentiment modulation, summarization, and dialogue, using language models with up to 6B parameters. DPO is able to bypass both fitting an explicit reward and performing RL to learn the policy using a single maximum likelihood objective. DPO is able to directly optimize a language model to adhere to human preferences without explicit reward modeling or reinforcement learning. DPO is a simple RL-free algorithm for training language models from preferences. Our experiments show that DPO is at least as effective as existing methods, including PPO-based RLHF, for learning from preferences in tasks such as sentiment modulation, summarization, and dialogue, using language models with up to 6B parameters.Direct Preference Optimization (DPO) is a new method for training language models from human preferences without reinforcement learning. Unlike existing methods that require training a reward model and then using reinforcement learning to optimize a policy, DPO directly optimizes the policy to satisfy human preferences using a simple classification objective. This approach eliminates the need for sampling from the language model during fine-tuning or significant hyperparameter tuning. DPO is stable, performant, and computationally lightweight, and has been shown to fine-tune language models to align with human preferences as well as or better than existing methods. Notably, DPO exceeds PPO-based RLHF in its ability to control sentiment of generations and matches or improves response quality in summarization and single-turn dialogue. DPO is also simpler to implement and train. The method is based on a theoretical preference model, such as the Bradley-Terry model, which measures how well a given reward function aligns with empirical preference data. DPO uses a change of variables to define the preference loss as a function of the policy directly. Given a dataset of human preferences over model responses, DPO can optimize a policy using a simple binary cross entropy objective, producing the optimal policy to an implicit reward function fit to the preference data. DPO is a simple RL-free algorithm for training language models from preferences. Our experiments show that DPO is at least as effective as existing methods, including PPO-based RLHF, for learning from preferences in tasks such as sentiment modulation, summarization, and dialogue, using language models with up to 6B parameters. DPO is able to bypass both fitting an explicit reward and performing RL to learn the policy using a single maximum likelihood objective. DPO is able to directly optimize a language model to adhere to human preferences without explicit reward modeling or reinforcement learning. DPO is a simple RL-free algorithm for training language models from preferences. Our experiments show that DPO is at least as effective as existing methods, including PPO-based RLHF, for learning from preferences in tasks such as sentiment modulation, summarization, and dialogue, using language models with up to 6B parameters.