[slides] Prototypical Reward Network for Data-Efficient RLHF

The paper introduces the Prototypical Reward Model (Proto-RM), a novel framework that enhances the efficiency and accuracy of reward models in Reinforcement Learning from Human Feedback (RLHF). RLHF is a method that integrates human feedback with large language models (LLMs) to improve their adaptability and alignment with human preferences. However, collecting human feedback can be resource-intensive and scalable issues arise, especially for complex tasks. Proto-RM leverages prototypical networks to enable stable and reliable structural learning from limited human feedback samples. By optimizing the embedding process in the reward model, Proto-RM allows the model to learn stable and reliable data representations, enhancing its adaptability and accuracy in interpreting human preferences. Extensive experiments on various datasets demonstrate that Proto-RM significantly improves the performance of reward models and LLMs in human feedback tasks, achieving comparable or better results compared to traditional methods while requiring significantly less data. The method consists of three key steps: Sample Encoding and Prototype Initialization, Prototype Update and Addition, and Reward Model Fine-tuning. These steps ensure that the reward model can effectively learn and extract vital parameter information from limited human feedback, guiding the model's behavior to align with human expectations. The paper also includes a detailed analysis of the method's effectiveness, including comparisons with baseline models, ablation studies, and human evaluations. The results show that Proto-RM consistently outperforms baseline models in terms of accuracy and data efficiency, making it a promising approach for enhancing the efficiency of reward models and optimizing the fine-tuning of language models under restricted feedback conditions.The paper introduces the Prototypical Reward Model (Proto-RM), a novel framework that enhances the efficiency and accuracy of reward models in Reinforcement Learning from Human Feedback (RLHF). RLHF is a method that integrates human feedback with large language models (LLMs) to improve their adaptability and alignment with human preferences. However, collecting human feedback can be resource-intensive and scalable issues arise, especially for complex tasks. Proto-RM leverages prototypical networks to enable stable and reliable structural learning from limited human feedback samples. By optimizing the embedding process in the reward model, Proto-RM allows the model to learn stable and reliable data representations, enhancing its adaptability and accuracy in interpreting human preferences. Extensive experiments on various datasets demonstrate that Proto-RM significantly improves the performance of reward models and LLMs in human feedback tasks, achieving comparable or better results compared to traditional methods while requiring significantly less data. The method consists of three key steps: Sample Encoding and Prototype Initialization, Prototype Update and Addition, and Reward Model Fine-tuning. These steps ensure that the reward model can effectively learn and extract vital parameter information from limited human feedback, guiding the model's behavior to align with human expectations. The paper also includes a detailed analysis of the method's effectiveness, including comparisons with baseline models, ablation studies, and human evaluations. The results show that Proto-RM consistently outperforms baseline models in terms of accuracy and data efficiency, making it a promising approach for enhancing the efficiency of reward models and optimizing the fine-tuning of language models under restricted feedback conditions.

Prototypical Reward Network for Data-Efficient RLHF

7 Jul 2024 | Jinghan Zhang, Xiting Wang, Yiqiao Jin, Changyu Chen, Xinhao Zhang, Kunpeng Liu