29 May 2024 | Angelica Chen, Sadhika Malladi, Lily H. Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, Kyunghyun Cho
Preference learning algorithms, such as RLHF and DPO, are widely used to align large language models (LLMs) with human preferences. However, this study reveals that these algorithms do not effectively learn preference rankings. The research shows that most state-of-the-art preference-tuned models achieve ranking accuracy below 60% on common preference datasets, indicating a significant alignment gap between observed and idealized ranking accuracies. The idealized ranking accuracy, derived from perfect optimization of DPO or RLHF objectives, is much higher than what is achieved in practice. This discrepancy is attributed to the DPO objective's inability to correct even mild ranking errors in the reference model. The study also demonstrates that ranking accuracy strongly correlates with win rate when the model is close to the reference model, but becomes anti-correlated when the model moves away. These findings highlight the limitations of current preference learning algorithms and the need for more refined methods to improve alignment and ranking accuracy.Preference learning algorithms, such as RLHF and DPO, are widely used to align large language models (LLMs) with human preferences. However, this study reveals that these algorithms do not effectively learn preference rankings. The research shows that most state-of-the-art preference-tuned models achieve ranking accuracy below 60% on common preference datasets, indicating a significant alignment gap between observed and idealized ranking accuracies. The idealized ranking accuracy, derived from perfect optimization of DPO or RLHF objectives, is much higher than what is achieved in practice. This discrepancy is attributed to the DPO objective's inability to correct even mild ranking errors in the reference model. The study also demonstrates that ranking accuracy strongly correlates with win rate when the model is close to the reference model, but becomes anti-correlated when the model moves away. These findings highlight the limitations of current preference learning algorithms and the need for more refined methods to improve alignment and ranking accuracy.