29 May 2024 | Adam Fisch*, Jacob Eisenstein*, Vicky Zayats*, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, Jonathan Berant*
The paper addresses the issue of robust preference optimization in language model (LM) post-training, particularly focusing on the limitations of Direct Preference Optimization (DPO). DPO directly optimizes policies based on preference data without training a reward model, but it often leads to overconfident and degenerate policies due to the single or few annotations per preference pair. To mitigate this, the authors propose a method called *distillation*, which trains the LM to produce probabilities that match the distribution induced by a reward model trained on the preference data. They also introduce a *pessimistic extension* to handle uncertainty in the reward model by optimizing against a family of reward models, ensuring that the policy performs well even under the worst-case scenario. Empirical results show that their approach improves robustness to distribution shifts in preference annotations while maintaining the simplicity of DPO. The theoretical analysis provides insights into the degenerative tendencies of DPO and highlights the advantages of their proposed methods.The paper addresses the issue of robust preference optimization in language model (LM) post-training, particularly focusing on the limitations of Direct Preference Optimization (DPO). DPO directly optimizes policies based on preference data without training a reward model, but it often leads to overconfident and degenerate policies due to the single or few annotations per preference pair. To mitigate this, the authors propose a method called *distillation*, which trains the LM to produce probabilities that match the distribution induced by a reward model trained on the preference data. They also introduce a *pessimistic extension* to handle uncertainty in the reward model by optimizing against a family of reward models, ensuring that the policy performs well even under the worst-case scenario. Empirical results show that their approach improves robustness to distribution shifts in preference annotations while maintaining the simplicity of DPO. The theoretical analysis provides insights into the degenerative tendencies of DPO and highlights the advantages of their proposed methods.