[slides and audio] Offline Regularised Reinforcement Learning for Large Language Models Alignment

This paper introduces DRO (Direct Reward Optimisation), a new framework for aligning large language models (LLMs) using single-trajectory datasets. DRO is designed to work in the offline setting with human feedback, and it does not require pairwise preferences, unlike traditional methods such as DPO or IPO. Instead, DRO uses a simple mean-squared objective that can be implemented in various ways. The framework is validated using T5 encoder-decoder language models and shows superior performance compared to baselines such as Kahneman-Tversky Optimization (KTO). DRO-V, a practical instantiation of DRO, combines offline policy learning with value function learning and outperforms KTO on the UltraFeedback dataset. The paper also discusses the theoretical foundations of DRO, including its relationship to policy optimisation and the role of the value function. The experiments show that DRO-V achieves significant improvements in performance, particularly in terms of win rates against the SFT policy and KTO baseline. The paper also explores the impact of hyperparameters and architectural choices on performance, and highlights the benefits of using separate networks for policy and value functions. Overall, DRO is shown to be a simple and effective method for single-trajectory policy optimisation in the context of LLM alignment.This paper introduces DRO (Direct Reward Optimisation), a new framework for aligning large language models (LLMs) using single-trajectory datasets. DRO is designed to work in the offline setting with human feedback, and it does not require pairwise preferences, unlike traditional methods such as DPO or IPO. Instead, DRO uses a simple mean-squared objective that can be implemented in various ways. The framework is validated using T5 encoder-decoder language models and shows superior performance compared to baselines such as Kahneman-Tversky Optimization (KTO). DRO-V, a practical instantiation of DRO, combines offline policy learning with value function learning and outperforms KTO on the UltraFeedback dataset. The paper also discusses the theoretical foundations of DRO, including its relationship to policy optimisation and the role of the value function. The experiments show that DRO-V achieves significant improvements in performance, particularly in terms of win rates against the SFT policy and KTO baseline. The paper also explores the impact of hyperparameters and architectural choices on performance, and highlights the benefits of using separate networks for policy and value functions. Overall, DRO is shown to be a simple and effective method for single-trajectory policy optimisation in the context of LLM alignment.