[slides] Preference Learning Algorithms Do Not Learn Preference Rankings

The paper "Preference Learning Algorithms Do Not Learn Preference Rankings" by Angelica Chen explores the effectiveness of preference learning algorithms (RLHF and DPO) in training large language models (LLMs) to produce outputs that are preferred by humans. Despite their widespread use, the inner workings of these algorithms are not well understood. The study reveals that state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets, significantly lower than the idealized ranking accuracy achievable under perfect conditions. The authors derive a formula for the idealized ranking accuracy and find a significant alignment gap between the observed and idealized accuracies. They attribute this gap to the DPO objective, which is empirically and theoretically ill-suited to correct even mild ranking errors in the reference model. The paper also demonstrates that ranking accuracy strongly correlates with the win rate metric when the model is close to the reference model, highlighting the differences between on-policy and off-policy preference learning algorithms. The findings underscore the need for more fine-grained analyses of preference training dynamics and suggest that existing preference learning algorithms struggle to achieve high ranking accuracies in practice.The paper "Preference Learning Algorithms Do Not Learn Preference Rankings" by Angelica Chen explores the effectiveness of preference learning algorithms (RLHF and DPO) in training large language models (LLMs) to produce outputs that are preferred by humans. Despite their widespread use, the inner workings of these algorithms are not well understood. The study reveals that state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets, significantly lower than the idealized ranking accuracy achievable under perfect conditions. The authors derive a formula for the idealized ranking accuracy and find a significant alignment gap between the observed and idealized accuracies. They attribute this gap to the DPO objective, which is empirically and theoretically ill-suited to correct even mild ranking errors in the reference model. The paper also demonstrates that ranking accuracy strongly correlates with the win rate metric when the model is close to the reference model, highlighting the differences between on-policy and off-policy preference learning algorithms. The findings underscore the need for more fine-grained analyses of preference training dynamics and suggest that existing preference learning algorithms struggle to achieve high ranking accuracies in practice.

Preference Learning Algorithms Do Not Learn Preference Rankings

29 May 2024 | Angelia Chen, Sadhika Malladi, Lily H. Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, Kyunghyun Cho