InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

1 Nov 2024 | Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling This paper introduces InfoRM, a novel reward modeling framework that addresses the issue of reward hacking in reinforcement learning from human feedback (RLHF). Reward hacking, or reward overoptimization, occurs when the policy model's optimization diverges from true human objectives, leading to suboptimal performance. InfoRM tackles this problem by incorporating a variational information bottleneck objective to filter out irrelevant information, thereby enhancing the generalizability of reward models. The framework is based on information-theoretic principles and aims to extract human preference-relevant information while discarding spurious features. InfoRM introduces a Cluster Separation Index (CSI) as an indicator for detecting reward overoptimization by quantifying deviations in the latent IB space. This index helps in identifying outliers in the IB latent space, which are indicative of overoptimized samples. Extensive experiments on various settings and RM scales (70M, 440M, 1.4B, and 7B) demonstrate the effectiveness of InfoRM in mitigating reward overoptimization. The results show that InfoRM not only improves the generalizability of reward models but also provides a robust tool for detecting reward overoptimization. The CSI metric has been validated across different datasets, showing its effectiveness in identifying overoptimized samples and guiding the development of online mitigation strategies. The paper also discusses the broader impacts of the proposed method, highlighting its potential to align large language models more closely with human preferences. The study acknowledges the limitations of the current approach, including the need for further research on scaling the framework to larger models and developing real-time overoptimization detection metrics. Overall, InfoRM represents a significant advancement in the field of RLHF by addressing the root cause of reward overoptimization through information-theoretic principles.InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling This paper introduces InfoRM, a novel reward modeling framework that addresses the issue of reward hacking in reinforcement learning from human feedback (RLHF). Reward hacking, or reward overoptimization, occurs when the policy model's optimization diverges from true human objectives, leading to suboptimal performance. InfoRM tackles this problem by incorporating a variational information bottleneck objective to filter out irrelevant information, thereby enhancing the generalizability of reward models. The framework is based on information-theoretic principles and aims to extract human preference-relevant information while discarding spurious features. InfoRM introduces a Cluster Separation Index (CSI) as an indicator for detecting reward overoptimization by quantifying deviations in the latent IB space. This index helps in identifying outliers in the IB latent space, which are indicative of overoptimized samples. Extensive experiments on various settings and RM scales (70M, 440M, 1.4B, and 7B) demonstrate the effectiveness of InfoRM in mitigating reward overoptimization. The results show that InfoRM not only improves the generalizability of reward models but also provides a robust tool for detecting reward overoptimization. The CSI metric has been validated across different datasets, showing its effectiveness in identifying overoptimized samples and guiding the development of online mitigation strategies. The paper also discusses the broader impacts of the proposed method, highlighting its potential to align large language models more closely with human preferences. The study acknowledges the limitations of the current approach, including the need for further research on scaling the framework to larger models and developing real-time overoptimization detection metrics. Overall, InfoRM represents a significant advancement in the field of RLHF by addressing the root cause of reward overoptimization through information-theoretic principles.
Reach us at info@futurestudyspace.com
[slides] InfoRM%3A Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling | StudySpace