Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

4 Apr 2024 | Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, Tengyang Xie
This paper introduces Direct Nash Optimization (DNO), a novel approach to post-training large language models (LLMs) using preference feedback from a powerful oracle. DNO aims to improve LLMs iteratively by optimizing over general preferences, addressing the limitations of traditional reward-based reinforcement learning (RLHF). Unlike RLHF, which often relies on point-wise rewards that cannot express complex intransitive or cyclic preferences, DNO directly optimizes over "pair-wise" or general preferences. DNO is designed as a batched-on-policy algorithm with a regression-based objective, making it straightforward and efficient to implement. The key contributions of DNO include: 1. **Theoretical Soundness**: DNO is provably convergent to the intended Nash equilibrium and guarantees monotonic improvement across iterations. 2. **Scalability**: DNO is efficient and scalable, striking a balance between deployment efficiency and adaptability. 3. **Empirical Performance**: DNO achieves state-of-the-art results on various benchmarks, outperforming models with more parameters and older versions of GPT-4. The paper also includes a thorough ablation study to analyze critical design choices, such as the choice of loss function, training paradigm, preference annotator quality, and training pair construction. These findings highlight the importance of carefully crafted methods in achieving substantial gains. Overall, DNO provides a promising solution for post-training LLMs, offering actionable insights for the AI research community.This paper introduces Direct Nash Optimization (DNO), a novel approach to post-training large language models (LLMs) using preference feedback from a powerful oracle. DNO aims to improve LLMs iteratively by optimizing over general preferences, addressing the limitations of traditional reward-based reinforcement learning (RLHF). Unlike RLHF, which often relies on point-wise rewards that cannot express complex intransitive or cyclic preferences, DNO directly optimizes over "pair-wise" or general preferences. DNO is designed as a batched-on-policy algorithm with a regression-based objective, making it straightforward and efficient to implement. The key contributions of DNO include: 1. **Theoretical Soundness**: DNO is provably convergent to the intended Nash equilibrium and guarantees monotonic improvement across iterations. 2. **Scalability**: DNO is efficient and scalable, striking a balance between deployment efficiency and adaptability. 3. **Empirical Performance**: DNO achieves state-of-the-art results on various benchmarks, outperforming models with more parameters and older versions of GPT-4. The paper also includes a thorough ablation study to analyze critical design choices, such as the choice of loss function, training paradigm, preference annotator quality, and training pair construction. These findings highlight the importance of carefully crafted methods in achieving substantial gains. Overall, DNO provides a promising solution for post-training LLMs, offering actionable insights for the AI research community.
Reach us at info@study.space