Understanding Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

The paper introduces a novel approach, Generalizable Reward Model (GRM), to enhance the generalization capabilities of reward models for Large Language Models (LLMs) trained using reinforcement learning from human feedback (RLHF). GRM regularizes the hidden states of the reward model by incorporating text-generation losses, while retaining the base model's language model head. This approach aims to improve the model's ability to generalize to unseen data and mitigate over-optimization issues. Experimental results demonstrate that GRM significantly enhances the accuracy of reward models on out-of-distribution (OOD) tasks and consistently improves the performance of both 2B and 7B reward models, especially in limited dataset settings. GRM also effectively alleviates over-optimization in both Best-of-$n$ (BoN) sampling and Proximal Policy Optimization (PPO) methods. The study highlights the potential of GRM as a more reliable and robust proxy reward model for human preferences in RLHF.The paper introduces a novel approach, Generalizable Reward Model (GRM), to enhance the generalization capabilities of reward models for Large Language Models (LLMs) trained using reinforcement learning from human feedback (RLHF). GRM regularizes the hidden states of the reward model by incorporating text-generation losses, while retaining the base model's language model head. This approach aims to improve the model's ability to generalize to unseen data and mitigate over-optimization issues. Experimental results demonstrate that GRM significantly enhances the accuracy of reward models on out-of-distribution (OOD) tasks and consistently improves the performance of both 2B and 7B reward models, especially in limited dataset settings. GRM also effectively alleviates over-optimization in both Best-of-$n$ (BoN) sampling and Proximal Policy Optimization (PPO) methods. The study highlights the potential of GRM as a more reliable and robust proxy reward model for human preferences in RLHF.

Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

14 Jun 2024 | Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, Tong Zhang