29 May 2024 | Adam Fisch*, Jacob Eisenstein*, Vicky Zayats*, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, Jonathan Berant*
This paper introduces a method for robust preference optimization through reward model distillation. The authors analyze the limitations of Direct Preference Optimization (DPO), which can lead to degenerate policies due to overconfidence in preference data. They propose a solution by distilling a reward model from a family of reward models, which helps in achieving better robustness to distribution shifts in preference annotations while preserving the simplicity of DPO.
The key idea is to train the language model to produce probabilities that match the distribution induced by a reward model trained on preference data. To account for uncertainty in the reward model, the authors optimize against a family of reward models that are likely to include at least one reasonable proxy for the preference distribution. This approach leads to improved robustness to distribution shifts in preference annotations.
The authors also introduce a pessimistic extension to their approach, which aims to maximize the worst-case improvement of the model across a plausible family of reward models. This strategy aligns with conservative offline reinforcement learning techniques. They show that this pessimistic objective can be equivalently expressed and optimized by adding a simple additional KL-divergence regularization to the original distillation objective.
Empirically, the authors find that reward model distillation, particularly pessimistic reward model distillation, leads to similar performance to prior direct preference optimization methods in settings where the preference datasets are unbiased, but significantly better performance in settings where the preference datasets are biased, when compared to DPO and the Identity Preference Optimization (IPO) framework.
Theoretical analysis shows that the proposed method is equivalent to optimizing the traditional online RLHF objective with reinforcement learning. The authors also show that the pessimistic objective can be expressed as a constrained optimization problem, which can be solved using a similar loss form as the distillation loss.
The results demonstrate that the proposed method achieves improved robustness to variations in preference dataset quality while maintaining the simplicity of the DPO framework. The authors conclude that explicit reward modeling remains a powerful vehicle for introducing regularization into post-training.This paper introduces a method for robust preference optimization through reward model distillation. The authors analyze the limitations of Direct Preference Optimization (DPO), which can lead to degenerate policies due to overconfidence in preference data. They propose a solution by distilling a reward model from a family of reward models, which helps in achieving better robustness to distribution shifts in preference annotations while preserving the simplicity of DPO.
The key idea is to train the language model to produce probabilities that match the distribution induced by a reward model trained on preference data. To account for uncertainty in the reward model, the authors optimize against a family of reward models that are likely to include at least one reasonable proxy for the preference distribution. This approach leads to improved robustness to distribution shifts in preference annotations.
The authors also introduce a pessimistic extension to their approach, which aims to maximize the worst-case improvement of the model across a plausible family of reward models. This strategy aligns with conservative offline reinforcement learning techniques. They show that this pessimistic objective can be equivalently expressed and optimized by adding a simple additional KL-divergence regularization to the original distillation objective.
Empirically, the authors find that reward model distillation, particularly pessimistic reward model distillation, leads to similar performance to prior direct preference optimization methods in settings where the preference datasets are unbiased, but significantly better performance in settings where the preference datasets are biased, when compared to DPO and the Identity Preference Optimization (IPO) framework.
Theoretical analysis shows that the proposed method is equivalent to optimizing the traditional online RLHF objective with reinforcement learning. The authors also show that the pessimistic objective can be expressed as a constrained optimization problem, which can be solved using a similar loss form as the distillation loss.
The results demonstrate that the proposed method achieves improved robustness to variations in preference dataset quality while maintaining the simplicity of the DPO framework. The authors conclude that explicit reward modeling remains a powerful vehicle for introducing regularization into post-training.