Human Alignment of Large Language Models through Online Preference Optimisation

Human Alignment of Large Language Models through Online Preference Optimisation

13 Mar 2024 | Daniele Calandriello, Daniel Guo, Rémi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Pilot
This paper presents a study on human alignment of large language models (LLMs) through online preference optimisation. The authors explore the equivalence between two recent alignment methods, Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD), and introduce a generalisation of IPO called IPO-MD that leverages the regularised sampling approach proposed by Nash-MD. They show that the online version of IPO is equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, they introduce IPO-MD, which generates data with a mixture policy between the online and reference policy, similar to the general Nash-MD algorithm. They compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task. The paper discusses various preference optimisation algorithms, including RLHF with a Bradley-Terry reward model, Direct Preference Optimisation (DPO), Sequence Likelihood Calibration (SLiC), Identity Policy Optimisation (IPO), and Nash-MD-PG. The authors highlight the differences between these algorithms in terms of contrastivity, offline/online data, equilibria, and regularised sampling. They propose Online IPO, an online variant of IPO, and IPO-MD, which interpolates between offline and online variants by using the lagged data distribution of Nash-MD-PG. They also provide an experimental suite contrasting these algorithms in several applications, which provides detailed comparisons between the proposed methods and several baselines. The authors show that IPO-MD and Online IPO are promising approaches to preference optimisation at scale as they are the most robust algorithms. They conclude that their work provides a theoretical bridge between IPO and Nash-MD-PG, and that the proposed algorithms are effective in aligning LLMs with human preferences. The paper also highlights the importance of regularised sampling and the potential benefits of using online data for effective regularisation. The results show that IPO and IPO-MD are statistically indistinguishable in terms of performance, and both consistently beat all other algorithms. The authors suggest that summarisation is a good test bed to showcase the quality of human alignment algorithms because it is a complex and high-in-demand task.This paper presents a study on human alignment of large language models (LLMs) through online preference optimisation. The authors explore the equivalence between two recent alignment methods, Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD), and introduce a generalisation of IPO called IPO-MD that leverages the regularised sampling approach proposed by Nash-MD. They show that the online version of IPO is equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, they introduce IPO-MD, which generates data with a mixture policy between the online and reference policy, similar to the general Nash-MD algorithm. They compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task. The paper discusses various preference optimisation algorithms, including RLHF with a Bradley-Terry reward model, Direct Preference Optimisation (DPO), Sequence Likelihood Calibration (SLiC), Identity Policy Optimisation (IPO), and Nash-MD-PG. The authors highlight the differences between these algorithms in terms of contrastivity, offline/online data, equilibria, and regularised sampling. They propose Online IPO, an online variant of IPO, and IPO-MD, which interpolates between offline and online variants by using the lagged data distribution of Nash-MD-PG. They also provide an experimental suite contrasting these algorithms in several applications, which provides detailed comparisons between the proposed methods and several baselines. The authors show that IPO-MD and Online IPO are promising approaches to preference optimisation at scale as they are the most robust algorithms. They conclude that their work provides a theoretical bridge between IPO and Nash-MD-PG, and that the proposed algorithms are effective in aligning LLMs with human preferences. The paper also highlights the importance of regularised sampling and the potential benefits of using online data for effective regularisation. The results show that IPO and IPO-MD are statistically indistinguishable in terms of performance, and both consistently beat all other algorithms. The authors suggest that summarisation is a good test bed to showcase the quality of human alignment algorithms because it is a complex and high-in-demand task.
Reach us at info@study.space
[slides and audio] Human Alignment of Large Language Models through Online Preference Optimisation