29 May 2024 | Pierre Harvey Richemond*, Yunhao Tang*, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos*, Bilal Piot*
The paper introduces a new framework called Direct Reward Optimization (DRO) for aligning large language models (LLMs) using single-trajectory datasets, where each dataset element consists of a prompt, a response, and a human feedback (e.g., a thumbs-up/down rating). Unlike traditional methods that rely on pairwise preference data, DRO uses a simple mean-squared objective and can be implemented in various ways. The authors validate DRO empirically using T5 encoder-decoder language models and show that it outperforms selected baselines, such as Kahneman-Tversky Optimization (KTO), on the *UltraFeedback* dataset. DRO is designed to work in an offline setting, leveraging the abundance of single-trajectory data, and does not require the explicit learning of a reward model. The paper also discusses the theoretical properties of DRO, including its uniqueness and the role of the value function, and provides practical implementation details. Empirical results demonstrate that DRO significantly outperforms KTO in side-by-side comparisons, both for T5-L and T5-XL encoders. The authors also explore the impact of hyperparameters and architecture choices, showing that parameter sharing and the choice of value function have significant effects on performance. Overall, DRO offers a simple and effective method for aligning LLMs using single-trajectory data, leveraging the scale and abundance of user feedback.The paper introduces a new framework called Direct Reward Optimization (DRO) for aligning large language models (LLMs) using single-trajectory datasets, where each dataset element consists of a prompt, a response, and a human feedback (e.g., a thumbs-up/down rating). Unlike traditional methods that rely on pairwise preference data, DRO uses a simple mean-squared objective and can be implemented in various ways. The authors validate DRO empirically using T5 encoder-decoder language models and show that it outperforms selected baselines, such as Kahneman-Tversky Optimization (KTO), on the *UltraFeedback* dataset. DRO is designed to work in an offline setting, leveraging the abundance of single-trajectory data, and does not require the explicit learning of a reward model. The paper also discusses the theoretical properties of DRO, including its uniqueness and the role of the value function, and provides practical implementation details. Empirical results demonstrate that DRO significantly outperforms KTO in side-by-side comparisons, both for T5-L and T5-XL encoders. The authors also explore the impact of hyperparameters and architecture choices, showing that parameter sharing and the choice of value function have significant effects on performance. Overall, DRO offers a simple and effective method for aligning LLMs using single-trajectory data, leveraging the scale and abundance of user feedback.