[slides] InfoRM%3A Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

The paper "InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling" addresses the critical challenge of reward hacking, or reward overoptimization, in Reinforcement Learning from Human Feedback (RLHF). This issue arises from reward misgeneralization, where reward models (RMs) compute rewards using irrelevant features that do not reflect human preferences. The authors propose a new framework called Inf$\delta$RM, which introduces a variational information bottleneck (IB) objective to filter out irrelevant information. They also identify a correlation between overoptimization and outliers in the IB latent space, leading to the development of the Cluster Separation Index (CSI) as an indicator of reward overoptimization. Extensive experiments on various datasets and model sizes demonstrate the effectiveness of Inf$\delta$RM in mitigating reward overoptimization and enhancing RLHF performance. The CSI is shown to be robust and effective in detecting overoptimization, providing valuable insights for online mitigation strategies. The paper concludes by highlighting the ethical implications and limitations of the work, emphasizing the need for further research in scaling the framework to larger models and developing real-time, lightweight overoptimization detection metrics.The paper "InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling" addresses the critical challenge of reward hacking, or reward overoptimization, in Reinforcement Learning from Human Feedback (RLHF). This issue arises from reward misgeneralization, where reward models (RMs) compute rewards using irrelevant features that do not reflect human preferences. The authors propose a new framework called Inf$\delta$RM, which introduces a variational information bottleneck (IB) objective to filter out irrelevant information. They also identify a correlation between overoptimization and outliers in the IB latent space, leading to the development of the Cluster Separation Index (CSI) as an indicator of reward overoptimization. Extensive experiments on various datasets and model sizes demonstrate the effectiveness of Inf$\delta$RM in mitigating reward overoptimization and enhancing RLHF performance. The CSI is shown to be robust and effective in detecting overoptimization, providing valuable insights for online mitigation strategies. The paper concludes by highlighting the ethical implications and limitations of the work, emphasizing the need for further research in scaling the framework to larger models and developing real-time, lightweight overoptimization detection metrics.

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

1 Nov 2024 | Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao