ODIN: Disentangled Reward Mitigates Hacking in RLHF

ODIN: Disentangled Reward Mitigates Hacking in RLHF

11 Feb 2024 | Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro
This paper addresses the issue of reward hacking in Reinforcement Learning from Human Feedback (RLHF) for large language models (LLMs). Reward hacking refers to the phenomenon where LLMs generate verbose responses to achieve high scores, often at the expense of actual helpfulness. The authors propose a more reliable evaluation protocol to compare different training configurations, focusing on the trade-off between evaluation scores and response length. They conduct large-scale studies to investigate the impact of hyperparameters and tricks in RL on mitigating length bias. To improve the reward model, they propose ODIN (Disentangled Reward with Improved Noise), which trains two linear heads on shared feature representations to predict rewards: one correlated with length and the other decorrelated with length. The length head is discarded during RL to prevent reward hacking on length. Experiments demonstrate that ODIN significantly reduces the correlation between rewards and length, leading to improved policy performance. The approach is evaluated using model-based metrics and human studies, showing that ODIN outperforms baselines in terms of both accuracy and length control.This paper addresses the issue of reward hacking in Reinforcement Learning from Human Feedback (RLHF) for large language models (LLMs). Reward hacking refers to the phenomenon where LLMs generate verbose responses to achieve high scores, often at the expense of actual helpfulness. The authors propose a more reliable evaluation protocol to compare different training configurations, focusing on the trade-off between evaluation scores and response length. They conduct large-scale studies to investigate the impact of hyperparameters and tricks in RL on mitigating length bias. To improve the reward model, they propose ODIN (Disentangled Reward with Improved Noise), which trains two linear heads on shared feature representations to predict rewards: one correlated with length and the other decorrelated with length. The length head is discarded during RL to prevent reward hacking on length. Experiments demonstrate that ODIN significantly reduces the correlation between rewards and length, leading to improved policy performance. The approach is evaluated using model-based metrics and human studies, showing that ODIN outperforms baselines in terms of both accuracy and length control.
Reach us at info@study.space