This paper introduces a novel approach called Generalizable Reward Model (GRM) to enhance the generalization ability of reward models for large language models (LLMs) against distribution shifts. The key idea is to regularize the hidden states of the reward model using text-generation losses, which helps preserve the text generation capabilities of the hidden states while improving the reward model's generalization performance. GRM retains the base model's language model head and incorporates a suite of text-generation losses to maintain the hidden states' text generation capabilities, while concurrently learning a reward head behind the same hidden states. The proposed method significantly improves the accuracy of learned reward models across a variety of out-of-distribution (OOD) tasks and effectively alleviates the over-optimization issue in reinforcement learning from human feedback (RLHF), offering a more reliable and robust preference learning paradigm.
The paper demonstrates that GRM outperforms existing methods in both in-distribution (ID) and out-of-distribution (OOD) tasks. It shows that GRM is robust to limited dataset sizes and effectively mitigates the over-optimization issue in both Best-of-n (BoN) and Proximal Policy Optimization (PPO) methods. Additionally, GRM exhibits robustness against label noise in the preference dataset. The results indicate that GRM is a strong contender in reward modeling tasks, exhibiting superior performance across various benchmarks. The study also highlights the effectiveness of SFT regularization as the most effective and stable solution among the three types of text-generation regularization. Overall, GRM provides a lightweight yet effective solution to enhance the generalization ability of reward models against distribution shifts, making it a promising approach for improving the reliability and robustness of preference learning in LLMs.This paper introduces a novel approach called Generalizable Reward Model (GRM) to enhance the generalization ability of reward models for large language models (LLMs) against distribution shifts. The key idea is to regularize the hidden states of the reward model using text-generation losses, which helps preserve the text generation capabilities of the hidden states while improving the reward model's generalization performance. GRM retains the base model's language model head and incorporates a suite of text-generation losses to maintain the hidden states' text generation capabilities, while concurrently learning a reward head behind the same hidden states. The proposed method significantly improves the accuracy of learned reward models across a variety of out-of-distribution (OOD) tasks and effectively alleviates the over-optimization issue in reinforcement learning from human feedback (RLHF), offering a more reliable and robust preference learning paradigm.
The paper demonstrates that GRM outperforms existing methods in both in-distribution (ID) and out-of-distribution (OOD) tasks. It shows that GRM is robust to limited dataset sizes and effectively mitigates the over-optimization issue in both Best-of-n (BoN) and Proximal Policy Optimization (PPO) methods. Additionally, GRM exhibits robustness against label noise in the preference dataset. The results indicate that GRM is a strong contender in reward modeling tasks, exhibiting superior performance across various benchmarks. The study also highlights the effectiveness of SFT regularization as the most effective and stable solution among the three types of text-generation regularization. Overall, GRM provides a lightweight yet effective solution to enhance the generalization ability of reward models against distribution shifts, making it a promising approach for improving the reliability and robustness of preference learning in LLMs.