Understanding LiPO%3A Listwise Preference Optimization through Learning-to-Rank

**LiPO: Listwise Preference Optimization through Learning-to-Rank** **Authors:** Tianqi Liu **Affiliations:** Google DeepMind, Google **Contact Information:** {tianqiliu, zhenqin}@google.com **Abstract:** Aligning language models (LMs) with curated human feedback is crucial for controlling their behavior in real-world applications. Recent policy optimization methods, such as DPO and SLiC, offer promising alternatives to traditional Reinforcement Learning from Human Feedback (RLHF). Human feedback often comes in the form of ranked lists over multiple responses, which can be more efficient than pairwise comparisons. This work formulates LM alignment as a *listwise* ranking problem and introduces the LiPO framework, where the policy can learn more effectively from a ranked list of plausible responses. This approach draws a connection to Learning-to-Rank (LTR), where existing preference optimization methods can be mapped to ranking objectives. We examine ranking objectives that are not well-studied for LM alignment, highlighting LiPO-λ, a method that leverages a state-of-the-art listwise ranking objective and weights each preference pair more advancedly. LiPO-λ outperforms DPO variants and SLiC on several preference alignment tasks with curated and real rankwise preference data. **Introduction:** Recent large language models (LMs) have achieved impressive performance on diverse tasks, but aligning them with human preferences remains challenging. RLHF is complex and resource-intensive, leading to the development of alternatives like DPO and SLiC, which optimize pairwise ranking losses directly from human preference data. These methods ignore listwise permutation information and label values, which can limit their effectiveness. This work proposes LiPO, a framework that generalizes recent preference optimization methods and allows for the examination of other alternatives through the lens of LTR. We provide a comprehensive study of ranking objectives, showing that LiPO-λ, a method based on LambdaLoss, outperforms existing methods in various tasks. **LiPO Framework:** The LiPO framework formulates LM alignment as a listwise ranking problem, where the policy learns to rank a list of responses. This approach leverages listwise preference data, which can provide more information than pairwise data. We map existing methods to specific ranking objectives and highlight the limitations of current approaches. LiPO-λ, a specific instantiation, uses LambdaLoss, which considers label values and dynamic permutations, leading to better performance. **Experiments:** We evaluate LiPO-λ on Reddit TL:DR and AnthropicHH datasets, showing superior performance compared to DPO and SLiC. Ablation studies demonstrate the benefits of listwise data, Lambda weight choices, and model sizes. LiPO-λ scales well with larger policy models and performs well in human evaluation. **Conclusion:** LiPO provides a comprehensive framework for LM alignment with listwise preference data, leveraging advanced ranking objectives. LiPO-λ, a specific**LiPO: Listwise Preference Optimization through Learning-to-Rank** **Authors:** Tianqi Liu **Affiliations:** Google DeepMind, Google **Contact Information:** {tianqiliu, zhenqin}@google.com **Abstract:** Aligning language models (LMs) with curated human feedback is crucial for controlling their behavior in real-world applications. Recent policy optimization methods, such as DPO and SLiC, offer promising alternatives to traditional Reinforcement Learning from Human Feedback (RLHF). Human feedback often comes in the form of ranked lists over multiple responses, which can be more efficient than pairwise comparisons. This work formulates LM alignment as a *listwise* ranking problem and introduces the LiPO framework, where the policy can learn more effectively from a ranked list of plausible responses. This approach draws a connection to Learning-to-Rank (LTR), where existing preference optimization methods can be mapped to ranking objectives. We examine ranking objectives that are not well-studied for LM alignment, highlighting LiPO-λ, a method that leverages a state-of-the-art listwise ranking objective and weights each preference pair more advancedly. LiPO-λ outperforms DPO variants and SLiC on several preference alignment tasks with curated and real rankwise preference data. **Introduction:** Recent large language models (LMs) have achieved impressive performance on diverse tasks, but aligning them with human preferences remains challenging. RLHF is complex and resource-intensive, leading to the development of alternatives like DPO and SLiC, which optimize pairwise ranking losses directly from human preference data. These methods ignore listwise permutation information and label values, which can limit their effectiveness. This work proposes LiPO, a framework that generalizes recent preference optimization methods and allows for the examination of other alternatives through the lens of LTR. We provide a comprehensive study of ranking objectives, showing that LiPO-λ, a method based on LambdaLoss, outperforms existing methods in various tasks. **LiPO Framework:** The LiPO framework formulates LM alignment as a listwise ranking problem, where the policy learns to rank a list of responses. This approach leverages listwise preference data, which can provide more information than pairwise data. We map existing methods to specific ranking objectives and highlight the limitations of current approaches. LiPO-λ, a specific instantiation, uses LambdaLoss, which considers label values and dynamic permutations, leading to better performance. **Experiments:** We evaluate LiPO-λ on Reddit TL:DR and AnthropicHH datasets, showing superior performance compared to DPO and SLiC. Ablation studies demonstrate the benefits of listwise data, Lambda weight choices, and model sizes. LiPO-λ scales well with larger policy models and performs well in human evaluation. **Conclusion:** LiPO provides a comprehensive framework for LM alignment with listwise preference data, leveraging advanced ranking objectives. LiPO-λ, a specific

LiPO: Listwise Preference Optimization through Learning-to-Rank

22 May 2024 | Tianqi Liu†*, Zhen Qin†*, Junru Wu†, Jiaming Shen†, Misha Khalman† Rishabh Joshi† Yao Zhao† Mohammad Saleh† Simon Baumgartner† Jialu Liu‡ Peter J. Liu† Xuanhui Wang†

22 May 2024 | Tianqi Liu†, Zhen Qin†, Junru Wu†, Jiaming Shen†, Misha Khalman† Rishabh Joshi† Yao Zhao† Mohammad Saleh† Simon Baumgartner† Jialu Liu‡ Peter J. Liu† Xuanhui Wang†