13 Mar 2024 | Daniele Calandriello * 1 Daniel Guo * 1 Rémi Munos 1 Mark Rowland 1 Yunhao Tang 1 Bernardo Avila Pires 1 Pierre Harvey Richemond 1 Charline Le Lan 1 Michal Valko 1 Tianqi Liu 1 Rishabh Joshi 1 Zeyu Zheng 1 Bilal Piot 1
This paper focuses on aligning the outputs of large language models with human preferences, a critical aspect for ensuring useful, safe, and pleasant user experiences. It introduces two main contributions: first, it demonstrates the equivalence between Identity Policy Optimization (IPO) and Nash Mirror Descent (Nash-MD), two recent alignment methods. Second, it proposes a generalization of IPO, named IPO-MD, which leverages the regularized sampling approach from Nash-MD.
The paper begins by reviewing existing methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimization (DPO), and Sequence Likelihood Calibration (SLiC). It then introduces IPO and Nash-MD, highlighting their differences in terms of offline/online data, contrastivity, and equilibria. The key finding is that while IPO is an offline method, Nash-MD is an online method using a preference model, their online versions are equivalent in finding the Nash equilibrium of the preference model through self-play.
The authors propose Online IPO, an online variant of IPO, and IPO-MD, which combines the strengths of both methods. They analyze the theoretical properties of these algorithms, showing that Online IPO's stationary points coincide with the Nash equilibrium of the preference game optimized by Nash-MD-PG. Empirical results on a summarization task demonstrate that IPO-MD and Online IPO are robust and outperform other methods, indicating their potential for large-scale preference optimization.This paper focuses on aligning the outputs of large language models with human preferences, a critical aspect for ensuring useful, safe, and pleasant user experiences. It introduces two main contributions: first, it demonstrates the equivalence between Identity Policy Optimization (IPO) and Nash Mirror Descent (Nash-MD), two recent alignment methods. Second, it proposes a generalization of IPO, named IPO-MD, which leverages the regularized sampling approach from Nash-MD.
The paper begins by reviewing existing methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimization (DPO), and Sequence Likelihood Calibration (SLiC). It then introduces IPO and Nash-MD, highlighting their differences in terms of offline/online data, contrastivity, and equilibria. The key finding is that while IPO is an offline method, Nash-MD is an online method using a preference model, their online versions are equivalent in finding the Nash equilibrium of the preference model through self-play.
The authors propose Online IPO, an online variant of IPO, and IPO-MD, which combines the strengths of both methods. They analyze the theoretical properties of these algorithms, showing that Online IPO's stationary points coincide with the Nash equilibrium of the preference game optimized by Nash-MD-PG. Empirical results on a summarization task demonstrate that IPO-MD and Online IPO are robust and outperform other methods, indicating their potential for large-scale preference optimization.