Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

2024-07-07 | Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu
This paper explores the alignment of large language models (LLMs) with human preferences using reinforcement learning with human feedback (RLHF) from a game-theoretic perspective. Unlike traditional reward-based approaches that rely on the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences, the paper formulates the problem as a two-player game and proposes a novel algorithm called Iterative Nash Policy Optimization (INPO). INPO uses no-regret learning to approximate the Nash policy by having the policy play against itself. Unlike previous methods that require estimating the expected win rate for individual responses, INPO introduces a new loss objective that is directly minimized over a preference dataset. The paper provides theoretical analysis and demonstrates the effectiveness of INPO through experiments on various benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing significant improvements over state-of-the-art iterative algorithms under the BT model assumption. Additionally, an ablation study highlights the benefits of incorporating KL regularization for response length control.This paper explores the alignment of large language models (LLMs) with human preferences using reinforcement learning with human feedback (RLHF) from a game-theoretic perspective. Unlike traditional reward-based approaches that rely on the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences, the paper formulates the problem as a two-player game and proposes a novel algorithm called Iterative Nash Policy Optimization (INPO). INPO uses no-regret learning to approximate the Nash policy by having the policy play against itself. Unlike previous methods that require estimating the expected win rate for individual responses, INPO introduces a new loss objective that is directly minimized over a preference dataset. The paper provides theoretical analysis and demonstrates the effectiveness of INPO through experiments on various benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing significant improvements over state-of-the-art iterative algorithms under the BT model assumption. Additionally, an ablation study highlights the benefits of incorporating KL regularization for response length control.
Reach us at info@study.space
[slides] Iterative Nash Policy Optimization%3A Aligning LLMs with General Preferences via No-Regret Learning | StudySpace