Understanding Self-Play Preference Optimization for Language Model Alignment

This paper introduces a self-play based method for language model alignment, called Self-Play Preference Optimization (SPPO), which aims to identify the Nash equilibrium policy in a constant-sum two-player game. Unlike traditional reinforcement learning from human feedback (RLHF) approaches that rely on parametric models like the Bradley-Terry model, SPPO directly works with preference probabilities, providing a more accurate reflection of human preferences. The method approximates the Nash equilibrium through iterative policy updates and enjoys theoretical convergence guarantees. Experiments show that SPPO can effectively increase the log-likelihood of chosen responses and decrease that of rejected responses, outperforming iterative Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO) on various benchmarks. Using only 60k prompts from the UltraFeedback dataset and a pre-trained preference model PairRM with 0.4B parameters, SPPO achieves a length-controlled win rate of 28.53% against GPT-4-Turbo on AlpacaEval 2-bench, and 38.77% on Llama-3-8B-Instruct. SPPO also demonstrates strong generalization across different tasks, including MT-Bench, the Open LLM Leaderboard, and the PairRM score. The method does not require external supervision from stronger language models, making it a promising approach for improving language model alignment.This paper introduces a self-play based method for language model alignment, called Self-Play Preference Optimization (SPPO), which aims to identify the Nash equilibrium policy in a constant-sum two-player game. Unlike traditional reinforcement learning from human feedback (RLHF) approaches that rely on parametric models like the Bradley-Terry model, SPPO directly works with preference probabilities, providing a more accurate reflection of human preferences. The method approximates the Nash equilibrium through iterative policy updates and enjoys theoretical convergence guarantees. Experiments show that SPPO can effectively increase the log-likelihood of chosen responses and decrease that of rejected responses, outperforming iterative Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO) on various benchmarks. Using only 60k prompts from the UltraFeedback dataset and a pre-trained preference model PairRM with 0.4B parameters, SPPO achieves a length-controlled win rate of 28.53% against GPT-4-Turbo on AlpacaEval 2-bench, and 38.77% on Llama-3-8B-Instruct. SPPO also demonstrates strong generalization across different tasks, including MT-Bench, the Open LLM Leaderboard, and the PairRM score. The method does not require external supervision from stronger language models, making it a promising approach for improving language model alignment.

Self-Play Preference Optimization for Language Model Alignment

14 Jun 2024 | Yue Wu*,† Zhiqing Sun*,‡ Huizhuo Yuan*,§ Kaixuan Ji*, Yiming Yang† Quanquan Gu**

14 Jun 2024 | Yue Wu,† Zhiqing Sun,‡ Huizhuo Yuan,§ Kaixuan Ji, Yiming Yang† Quanquan Gu**