28 May 2024 | Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires and Bilal Piot
This paper introduces Generalized Preference Optimization (GPO), a unified framework for offline preference optimization that encompasses existing methods like DPO, IPO, and SLiC as special cases. GPO parameterizes preference optimization losses using a family of convex functions, enabling a unified view of offline alignment algorithms. The framework provides insights into how offline algorithms enforce regularization through the design of the convex function. The paper analyzes the connections and differences between offline regularization and KL divergence regularization, showing that different GPO variants achieve similar trade-offs between regularization and performance, though optimal hyper-parameters may vary. Experiments demonstrate that GPO variants perform similarly across different tasks, with regularization strength depending on the convex function used. The paper also highlights the challenges of enforcing KL divergence constraints with offline losses and provides empirical insights into the regularization-performance trade-off for various GPO variants. Overall, GPO offers new algorithmic tools and empirical insights for alignment practitioners.This paper introduces Generalized Preference Optimization (GPO), a unified framework for offline preference optimization that encompasses existing methods like DPO, IPO, and SLiC as special cases. GPO parameterizes preference optimization losses using a family of convex functions, enabling a unified view of offline alignment algorithms. The framework provides insights into how offline algorithms enforce regularization through the design of the convex function. The paper analyzes the connections and differences between offline regularization and KL divergence regularization, showing that different GPO variants achieve similar trade-offs between regularization and performance, though optimal hyper-parameters may vary. Experiments demonstrate that GPO variants perform similarly across different tasks, with regularization strength depending on the convex function used. The paper also highlights the challenges of enforcing KL divergence constraints with offline losses and provides empirical insights into the regularization-performance trade-off for various GPO variants. Overall, GPO offers new algorithmic tools and empirical insights for alignment practitioners.