Understanding Transforming and Combining Rewards for Aligning Large Language Models

This paper addresses the challenges of aligning large language models (LLMs) to human preferences, particularly in the context of reinforcement learning from human feedback (RLHF). The authors study two key problems: transforming reward models to improve alignment and combining multiple reward models for multiple properties. They propose a log-sigmoid-centered transformation (LSC-transformation) for rewards learned from Bradley-Terry models, which emphasizes improving poorly performing outputs and enables principled aggregation of multiple reward models. The LSC-transformation is derived from a probabilistic interpretation of the alignment procedure and has two main benefits: it mitigates underfitting and reward hacking, and it allows for the logical conjunction of multiple reward models. Experiments show that using the LSC-transformation leads to substantial improvements in the quality of aligned models, both in terms of helpfulness and harmlessness. The paper also discusses the choice of reference rewards and the importance of uniform reward improvements.This paper addresses the challenges of aligning large language models (LLMs) to human preferences, particularly in the context of reinforcement learning from human feedback (RLHF). The authors study two key problems: transforming reward models to improve alignment and combining multiple reward models for multiple properties. They propose a log-sigmoid-centered transformation (LSC-transformation) for rewards learned from Bradley-Terry models, which emphasizes improving poorly performing outputs and enables principled aggregation of multiple reward models. The LSC-transformation is derived from a probabilistic interpretation of the alignment procedure and has two main benefits: it mitigates underfitting and reward hacking, and it allows for the logical conjunction of multiple reward models. Experiments show that using the LSC-transformation leads to substantial improvements in the quality of aligned models, both in terms of helpfulness and harmlessness. The paper also discusses the choice of reference rewards and the importance of uniform reward improvements.

Transforming and Combining Rewards for Aligning Large Language Models

19 Jul 2024 | Zihao Wang*1, Chirag Nagpal2, Jonathan Berant3, Jacob Eisenstein3, Alex D'Amour3, Sanmi Koyejo4,3, and Victor Veitch1,3