14 Jun 2024 | Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu
Self-Play Preference Optimization (SPPO) is a novel method for language model alignment that treats the problem as a constant-sum two-player game, aiming to identify the Nash equilibrium policy. Unlike traditional reinforcement learning from human feedback (RLHF) methods that rely on parametric models like the Bradley-Terry model, SPPO directly works with preference probabilities to more accurately reflect human preferences. This approach allows for more flexible and accurate alignment of language models. SPPO approximates the Nash equilibrium through iterative policy updates and guarantees theoretical convergence. It effectively increases the log-likelihood of the chosen response and decreases that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss methods like Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO).
In experiments, SPPO achieves state-of-the-art results on AlpacaEval 2.0, outperforming DPO and IPO on multiple benchmarks. Using only 60k prompts from the UltraFeedback dataset and a pre-trained preference model with 0.4B parameters, SPPO fine-tunes Mistral-7B-Instruct-v0.2 to achieve a length-controlled win rate of 28.53% against GPT-4-Turbo. Starting from a stronger base model, Llama-3-8B-Instruct, SPPO achieves a length-controlled win rate of 38.77%. Notably, SPPO's strong performance is achieved without external supervision from stronger language models like GPT-4.
SPPO is designed to handle the complexity of human preferences, which can be non-transitive and irrational. It uses a self-play mechanism where the policy is fine-tuned on synthetic data generated by itself, allowing for more effective alignment. The method is theoretically grounded and has been shown to converge to an approximate Nash equilibrium. SPPO's performance is consistent across different tasks and benchmarks, demonstrating its effectiveness in aligning language models with human preferences. The algorithm is implemented with a focus on efficiency and scalability, making it suitable for large-scale language model alignment.Self-Play Preference Optimization (SPPO) is a novel method for language model alignment that treats the problem as a constant-sum two-player game, aiming to identify the Nash equilibrium policy. Unlike traditional reinforcement learning from human feedback (RLHF) methods that rely on parametric models like the Bradley-Terry model, SPPO directly works with preference probabilities to more accurately reflect human preferences. This approach allows for more flexible and accurate alignment of language models. SPPO approximates the Nash equilibrium through iterative policy updates and guarantees theoretical convergence. It effectively increases the log-likelihood of the chosen response and decreases that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss methods like Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO).
In experiments, SPPO achieves state-of-the-art results on AlpacaEval 2.0, outperforming DPO and IPO on multiple benchmarks. Using only 60k prompts from the UltraFeedback dataset and a pre-trained preference model with 0.4B parameters, SPPO fine-tunes Mistral-7B-Instruct-v0.2 to achieve a length-controlled win rate of 28.53% against GPT-4-Turbo. Starting from a stronger base model, Llama-3-8B-Instruct, SPPO achieves a length-controlled win rate of 38.77%. Notably, SPPO's strong performance is achieved without external supervision from stronger language models like GPT-4.
SPPO is designed to handle the complexity of human preferences, which can be non-transitive and irrational. It uses a self-play mechanism where the policy is fine-tuned on synthetic data generated by itself, allowing for more effective alignment. The method is theoretically grounded and has been shown to converge to an approximate Nash equilibrium. SPPO's performance is consistent across different tasks and benchmarks, demonstrating its effectiveness in aligning language models with human preferences. The algorithm is implemented with a focus on efficiency and scalability, making it suitable for large-scale language model alignment.