Generalized Preference Optimization: A Unified Approach to Offline Alignment

Generalized Preference Optimization: A Unified Approach to Offline Alignment

28 May 2024 | Yunhao Tangβ, Zhaohan Daniel Guoβ, Zeyu Zhengβ, Daniele Calandrielloβ, Rémi Munosβ, Mark Rowlandβ, Pierre Harvey Richemondβ, Michal Valkoβ, Bernardo Ávila Piresβ and Bilal Piotβ
The paper introduces Generalized Preference Optimization (GPO), a unified framework for offline preference optimization, which allows fine-tuning large models directly from offline data. GPO parameterizes offline losses using a general class of convex functions, encompassing existing algorithms like DPO, IPO, and SLiC as special cases. The framework provides insights into how offline algorithms enforce regularization through the design of the convex function defining the loss. The authors analyze and experimentally demonstrate the connections and subtle differences between offline regularization and KL divergence regularization, which is intended by the canonical RLHF formulation. They show that different GPO variants achieve similar trade-offs between regularization and performance, with the optimal hyper-parameter values varying as predicted by theory. The results offer new algorithmic toolkits and empirical insights for alignment practitioners. The paper also discusses the challenges and limitations of the approach, suggesting future directions for research.The paper introduces Generalized Preference Optimization (GPO), a unified framework for offline preference optimization, which allows fine-tuning large models directly from offline data. GPO parameterizes offline losses using a general class of convex functions, encompassing existing algorithms like DPO, IPO, and SLiC as special cases. The framework provides insights into how offline algorithms enforce regularization through the design of the convex function defining the loss. The authors analyze and experimentally demonstrate the connections and subtle differences between offline regularization and KL divergence regularization, which is intended by the canonical RLHF formulation. They show that different GPO variants achieve similar trade-offs between regularization and performance, with the optimal hyper-parameter values varying as predicted by theory. The results offer new algorithmic toolkits and empirical insights for alignment practitioners. The paper also discusses the challenges and limitations of the approach, suggesting future directions for research.
Reach us at info@study.space
[slides and audio] Generalized Preference Optimization%3A A Unified Approach to Offline Alignment