2024 | Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, and Victor Veitch
This paper introduces the LSC-transformation (log-sigmoid-centered transformation) for aligning large language models (LLMs) with human preferences. The transformation is derived from a probabilistic interpretation of the alignment process and is shown to improve alignment performance by emphasizing the improvement of poorly-performing outputs and mitigating reward hacking. The LSC-transformation involves applying a log-sigmoid function to the centered rewards, which results in a transformation that emphasizes improving poorly-performing outputs rather than those that already score well. This transformation also enables principled aggregation of rewards by linking summation to logical conjunction, where the sum of transformed rewards corresponds to the probability that the output is "good" in all measured properties.
The paper studies two key problems in aligning LLMs: (1) how to improve the alignment step by transforming the learned reward model, and (2) how to combine multiple reward models for different properties. The LSC-transformation is shown to alleviate reward hacking and underfitting, and aligning to the transformed reward leads to substantial improvements in LLM performance. Experiments show that the LSC-transformation leads to better KL and win-rate trade-offs in single-reward scenarios and improves alignment when combining multiple reward models. The transformation is also shown to reduce shortcuts in generated responses, such as formatting outputs as lists for helpfulness and recommending professional help for harmlessness.
The paper also discusses reward aggregation for multiple objectives, showing that summing transformed rewards corresponds to logical AND, leading to better alignment performance. The LSC-transformation is shown to be effective in both single-reward and multi-reward scenarios, and is found to be more robust to reward overoptimization and underfitting. The paper concludes that the LSC-transformation provides a principled and effective method for aligning LLMs with human preferences.This paper introduces the LSC-transformation (log-sigmoid-centered transformation) for aligning large language models (LLMs) with human preferences. The transformation is derived from a probabilistic interpretation of the alignment process and is shown to improve alignment performance by emphasizing the improvement of poorly-performing outputs and mitigating reward hacking. The LSC-transformation involves applying a log-sigmoid function to the centered rewards, which results in a transformation that emphasizes improving poorly-performing outputs rather than those that already score well. This transformation also enables principled aggregation of rewards by linking summation to logical conjunction, where the sum of transformed rewards corresponds to the probability that the output is "good" in all measured properties.
The paper studies two key problems in aligning LLMs: (1) how to improve the alignment step by transforming the learned reward model, and (2) how to combine multiple reward models for different properties. The LSC-transformation is shown to alleviate reward hacking and underfitting, and aligning to the transformed reward leads to substantial improvements in LLM performance. Experiments show that the LSC-transformation leads to better KL and win-rate trade-offs in single-reward scenarios and improves alignment when combining multiple reward models. The transformation is also shown to reduce shortcuts in generated responses, such as formatting outputs as lists for helpfulness and recommending professional help for harmlessness.
The paper also discusses reward aggregation for multiple objectives, showing that summing transformed rewards corresponds to logical AND, leading to better alignment performance. The LSC-transformation is shown to be effective in both single-reward and multi-reward scenarios, and is found to be more robust to reward overoptimization and underfitting. The paper concludes that the LSC-transformation provides a principled and effective method for aligning LLMs with human preferences.