LiPO: Listwise Preference Optimization through Learning-to-Rank

LiPO: Listwise Preference Optimization through Learning-to-Rank

22 May 2024 | Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, Xuanhui Wang
LiPO: Listwise Preference Optimization through Learning-to-Rank This paper introduces LiPO, a framework for aligning language models (LMs) with human preferences by treating preference optimization as a listwise ranking problem. The key idea is to directly learn from ranked lists of responses, which can be more effective than pairwise methods for LM alignment. The LiPO framework is connected to the Learning-to-Rank (LTR) literature, where existing preference optimization methods can be mapped to ranking objectives. The paper proposes LiPO-λ, a method that leverages a state-of-the-art listwise ranking objective and weights each preference pair in a more advanced manner. LiPO-λ outperforms DPO and SLiC on several preference alignment tasks with both curated and real rankwise preference data. The paper also examines various ranking objectives, including pairwise and listwise losses, and shows that LiPO-λ achieves competitive performance across multiple tasks. The framework is evaluated on the Reddit TL;DR and AnthropicHH datasets, and the results show that LiPO-λ outperforms existing methods in both tasks. The paper also discusses the limitations of existing methods and highlights the importance of considering label values and listwise permutation information in preference optimization. The results demonstrate that LiPO-λ is a promising approach for aligning LMs with human preferences.LiPO: Listwise Preference Optimization through Learning-to-Rank This paper introduces LiPO, a framework for aligning language models (LMs) with human preferences by treating preference optimization as a listwise ranking problem. The key idea is to directly learn from ranked lists of responses, which can be more effective than pairwise methods for LM alignment. The LiPO framework is connected to the Learning-to-Rank (LTR) literature, where existing preference optimization methods can be mapped to ranking objectives. The paper proposes LiPO-λ, a method that leverages a state-of-the-art listwise ranking objective and weights each preference pair in a more advanced manner. LiPO-λ outperforms DPO and SLiC on several preference alignment tasks with both curated and real rankwise preference data. The paper also examines various ranking objectives, including pairwise and listwise losses, and shows that LiPO-λ achieves competitive performance across multiple tasks. The framework is evaluated on the Reddit TL;DR and AnthropicHH datasets, and the results show that LiPO-λ outperforms existing methods in both tasks. The paper also discusses the limitations of existing methods and highlights the importance of considering label values and listwise permutation information in preference optimization. The results demonstrate that LiPO-λ is a promising approach for aligning LMs with human preferences.
Reach us at info@study.space