4 Apr 2024 | Corby Rosset*, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah*, Tengyang Xie*
This paper introduces Direct Nash Optimization (DNO), a novel algorithm for post-training large language models (LLMs) using preference feedback from a powerful oracle. DNO combines the simplicity and stability of contrastive learning with the theoretical generality of optimizing general preferences. It is a batched on-policy algorithm that uses a regression-based objective, making it efficient and scalable. DNO ensures monotonic improvement across iterations, allowing it to improve even over strong teachers like GPT-4. In experiments, a 7B parameter 0rca-2.5 model aligned with DNO achieves a 33% win-rate against GPT-4-Turbo on AlpacaEval 2.0, an absolute gain of 26% over the initial model. It outperforms models with more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4. Ablation studies show that careful design choices, such as preference pair selection and using LLMs as preference annotators, lead to significant improvements. DNO is theoretically proven to converge to the Nash equilibrium and improve monotonically. It is also practically efficient, with a scalable implementation that uses contrastive updates and off-policy samples from a powerful teacher. The results demonstrate the effectiveness of DNO in post-training LLMs, offering actionable insights for AI research.This paper introduces Direct Nash Optimization (DNO), a novel algorithm for post-training large language models (LLMs) using preference feedback from a powerful oracle. DNO combines the simplicity and stability of contrastive learning with the theoretical generality of optimizing general preferences. It is a batched on-policy algorithm that uses a regression-based objective, making it efficient and scalable. DNO ensures monotonic improvement across iterations, allowing it to improve even over strong teachers like GPT-4. In experiments, a 7B parameter 0rca-2.5 model aligned with DNO achieves a 33% win-rate against GPT-4-Turbo on AlpacaEval 2.0, an absolute gain of 26% over the initial model. It outperforms models with more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4. Ablation studies show that careful design choices, such as preference pair selection and using LLMs as preference annotators, lead to significant improvements. DNO is theoretically proven to converge to the Nash equilibrium and improve monotonically. It is also practically efficient, with a scalable implementation that uses contrastive updates and off-policy samples from a powerful teacher. The results demonstrate the effectiveness of DNO in post-training LLMs, offering actionable insights for AI research.