Understanding Learn Your Reference Model for Real Good Alignment

The paper introduces a novel approach to offline alignment methods for Large Language Models (LLMs) called Trust Region (TR) methods, which dynamically update the reference policy during training to mitigate overoptimization. Overoptimization occurs when the trained model deviates excessively from the reference policy, leading to a decrease in sample quality. The TR methods, including TR-DPO, TR-IPO, and TR-KTO, are variants of existing alignment techniques (DPO, IPO, KTO) that incorporate reference policy updates. The authors hypothesize that updating the reference policy can prevent the model from reducing the probabilities of chosen sequences, thereby reducing overoptimization. Experimental results on various datasets and benchmarks show that TR methods outperform their base counterparts in terms of performance and reduce overoptimization, as evidenced by lower KL divergence and higher Human-Centric (HC) metrics. The paper also discusses the limitations and future directions for the proposed methods, highlighting the need for further research on generalization to different domains and datasets.The paper introduces a novel approach to offline alignment methods for Large Language Models (LLMs) called Trust Region (TR) methods, which dynamically update the reference policy during training to mitigate overoptimization. Overoptimization occurs when the trained model deviates excessively from the reference policy, leading to a decrease in sample quality. The TR methods, including TR-DPO, TR-IPO, and TR-KTO, are variants of existing alignment techniques (DPO, IPO, KTO) that incorporate reference policy updates. The authors hypothesize that updating the reference policy can prevent the model from reducing the probabilities of chosen sequences, thereby reducing overoptimization. Experimental results on various datasets and benchmarks show that TR methods outperform their base counterparts in terms of performance and reduce overoptimization, as evidenced by lower KL divergence and higher Human-Centric (HC) metrics. The paper also discusses the limitations and future directions for the proposed methods, highlighting the need for further research on generalization to different domains and datasets.

LEARN YOUR REFERENCE MODEL FOR REAL GOOD ALIGNMENT

25 Feb 2025 | Alexey Gorbatovski, Boris Shaposhnikov*, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrillov