2025 | Alexey Gorbatovskiy, Boris Shaposhnikov*, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov
This paper introduces Trust Region (TR) alignment methods—TR-DPO, TR-IPO, and TR-KTO—that dynamically update the reference policy during training, addressing the issue of reward overoptimization in offline Large Language Model (LLM) alignment. Traditional offline alignment methods, such as DPO, IPO, and KTO, are susceptible to overoptimization, where the model deviates excessively from the reference policy, leading to decreased sample quality. The TR approach mitigates this by incorporating soft or hard updates to the reference policy, enabling models to maintain strong performance even when deviating from the initial reference policy.
The TR methods are evaluated on various tasks and benchmarks, including task-specific datasets like Anthropic-HH and Reddit TL;DR, as well as general benchmarks like AlpacaEval 2 and Arena-Hard. Results show that TR methods outperform their base counterparts, achieving significant improvements in performance metrics such as win rates and Human-Centric (HC) metrics. For example, on the AlpacaEval 2 benchmark, TR-DPO, TR-IPO, and TR-KTO show improvements of 9.5, 15.1, and 2.3 points, respectively, compared to classical methods.
The effectiveness of TR methods is supported by analysis of KL divergence and HC metrics, which show that TR methods maintain better performance even as they diverge from the initial reference policy. The TR approach also reduces overoptimization by preventing the model from assigning higher probabilities to out-of-domain (OOD) data, thus improving alignment and diversity.
The paper also discusses the limitations of TR methods, including the need for further research on generalization to other domains and modalities. Overall, the TR methods demonstrate the importance of incorporating reference policy updates to enhance training dynamics and improve alignment performance in LLMs.This paper introduces Trust Region (TR) alignment methods—TR-DPO, TR-IPO, and TR-KTO—that dynamically update the reference policy during training, addressing the issue of reward overoptimization in offline Large Language Model (LLM) alignment. Traditional offline alignment methods, such as DPO, IPO, and KTO, are susceptible to overoptimization, where the model deviates excessively from the reference policy, leading to decreased sample quality. The TR approach mitigates this by incorporating soft or hard updates to the reference policy, enabling models to maintain strong performance even when deviating from the initial reference policy.
The TR methods are evaluated on various tasks and benchmarks, including task-specific datasets like Anthropic-HH and Reddit TL;DR, as well as general benchmarks like AlpacaEval 2 and Arena-Hard. Results show that TR methods outperform their base counterparts, achieving significant improvements in performance metrics such as win rates and Human-Centric (HC) metrics. For example, on the AlpacaEval 2 benchmark, TR-DPO, TR-IPO, and TR-KTO show improvements of 9.5, 15.1, and 2.3 points, respectively, compared to classical methods.
The effectiveness of TR methods is supported by analysis of KL divergence and HC metrics, which show that TR methods maintain better performance even as they diverge from the initial reference policy. The TR approach also reduces overoptimization by preventing the model from assigning higher probabilities to out-of-domain (OOD) data, thus improving alignment and diversity.
The paper also discusses the limitations of TR methods, including the need for further research on generalization to other domains and modalities. Overall, the TR methods demonstrate the importance of incorporating reference policy updates to enhance training dynamics and improve alignment performance in LLMs.