29 May 2024 | Pierre Harvey Richemond*, Yunhao Tang*, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos*, Bilal Piot*
This paper introduces DRO (Direct Reward Optimisation), a new framework for aligning large language models (LLMs) using single-trajectory datasets. DRO is designed to work in the offline setting with human feedback, and it does not require pairwise preferences, unlike traditional methods such as DPO or IPO. Instead, DRO uses a simple mean-squared objective that can be implemented in various ways. The framework is validated using T5 encoder-decoder language models and shows superior performance compared to baselines such as Kahneman-Tversky Optimization (KTO). DRO-V, a practical instantiation of DRO, combines offline policy learning with value function learning and outperforms KTO on the UltraFeedback dataset. The paper also discusses the theoretical foundations of DRO, including its relationship to policy optimisation and the role of the value function. The experiments show that DRO-V achieves significant improvements in performance, particularly in terms of win rates against the SFT policy and KTO baseline. The paper also explores the impact of hyperparameters and architectural choices on performance, and highlights the benefits of using separate networks for policy and value functions. Overall, DRO is shown to be a simple and effective method for single-trajectory policy optimisation in the context of LLM alignment.This paper introduces DRO (Direct Reward Optimisation), a new framework for aligning large language models (LLMs) using single-trajectory datasets. DRO is designed to work in the offline setting with human feedback, and it does not require pairwise preferences, unlike traditional methods such as DPO or IPO. Instead, DRO uses a simple mean-squared objective that can be implemented in various ways. The framework is validated using T5 encoder-decoder language models and shows superior performance compared to baselines such as Kahneman-Tversky Optimization (KTO). DRO-V, a practical instantiation of DRO, combines offline policy learning with value function learning and outperforms KTO on the UltraFeedback dataset. The paper also discusses the theoretical foundations of DRO, including its relationship to policy optimisation and the role of the value function. The experiments show that DRO-V achieves significant improvements in performance, particularly in terms of win rates against the SFT policy and KTO baseline. The paper also explores the impact of hyperparameters and architectural choices on performance, and highlights the benefits of using separate networks for policy and value functions. Overall, DRO is shown to be a simple and effective method for single-trajectory policy optimisation in the context of LLM alignment.