Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

7 Jul 2024 | Yuheng Zhang¹²*, Dian Yu², Baolin Peng², Linfeng Song², Ye Tian², Mingyue Huo¹, Nan Jiang¹, Haitao Mi², Dong Yu²
This paper proposes a novel algorithm called Iterative Nash Policy Optimization (INPO) for aligning large language models (LLMs) with human preferences through no-regret learning. Unlike previous reward-based methods that rely on the Bradley-Terry (BT) model assumption, INPO formulates the problem as a two-player game and uses no-regret learning to approximate the Nash policy. The key idea is to let the policy play against itself to improve its performance, avoiding the need to estimate the expected win rate for individual responses. Instead, INPO introduces a new loss objective that is directly minimized over a preference dataset. Theoretical analysis shows that the minimizer of this loss corresponds to the target policy. Experiments on benchmarks like AlpacaEval 2.0 and Arena-Hard demonstrate that INPO achieves significant improvements over the state-of-the-art iterative algorithm under the BT model assumption. Additionally, an ablation study highlights the benefits of incorporating KL regularization for response length control. The algorithm is designed to learn the Nash policy without requiring access to the expected win rate, making it more efficient and practical for real-world applications.This paper proposes a novel algorithm called Iterative Nash Policy Optimization (INPO) for aligning large language models (LLMs) with human preferences through no-regret learning. Unlike previous reward-based methods that rely on the Bradley-Terry (BT) model assumption, INPO formulates the problem as a two-player game and uses no-regret learning to approximate the Nash policy. The key idea is to let the policy play against itself to improve its performance, avoiding the need to estimate the expected win rate for individual responses. Instead, INPO introduces a new loss objective that is directly minimized over a preference dataset. Theoretical analysis shows that the minimizer of this loss corresponds to the target policy. Experiments on benchmarks like AlpacaEval 2.0 and Arena-Hard demonstrate that INPO achieves significant improvements over the state-of-the-art iterative algorithm under the BT model assumption. Additionally, an ablation study highlights the benefits of incorporating KL regularization for response length control. The algorithm is designed to learn the Nash policy without requiring access to the expected win rate, making it more efficient and practical for real-world applications.
Reach us at info@study.space
Understanding Iterative Nash Policy Optimization%3A Aligning LLMs with General Preferences via No-Regret Learning