Prototypical Reward Network for Data-Efficient RLHF

Prototypical Reward Network for Data-Efficient RLHF

7 Jul 2024 | Jinghan Zhang, Xiting Wang, Yiqiao Jin, Changyu Chen, Xinhao Zhang, Kunpeng Liu
This paper proposes Proto-RM, a prototypical network-based method to improve reward models in Reinforcement Learning from Human Feedback (RLHF). The method enhances the reward model's ability to learn from limited human feedback, achieving better performance with significantly less data. Proto-RM leverages prototypical networks to enable stable and reliable structural learning from fewer samples, improving the adaptability and accuracy of large language models (LLMs) in interpreting human preferences. The framework aggregates similar examples from the embedding space into prototypes, which are then used to fine-tune the reward model. This approach allows the reward model to learn stable and reliable data representation structures with limited sample sizes, making it particularly suitable for scenarios with limited data and complex human preferences. The proposed method consists of three key steps: sample encoding and prototype initialization, prototype update and addition, and reward model fine-tuning. During sample encoding, the reward model encodes samples, which are then used to initialize prototypes. The prototypes are refined based on their distances to the samples, and the reward model's parameters are updated accordingly. Finally, the refined prototypes and encodings are used to train the reward model, which aims to accurately evaluate and guide the outputs of the language model. Experiments on various datasets demonstrate that Proto-RM significantly improves the performance of reward models and LLMs in human feedback tasks, achieving comparable or better results than traditional methods while requiring significantly less data. The method is particularly effective in data-limited scenarios, showing clear advantages and achieving the effectiveness of training with extensive data even when using limited samples. The results indicate that Proto-RM is a promising direction for enhancing the efficiency of reward models and optimizing the fine-tuning of language models under restricted feedback conditions.This paper proposes Proto-RM, a prototypical network-based method to improve reward models in Reinforcement Learning from Human Feedback (RLHF). The method enhances the reward model's ability to learn from limited human feedback, achieving better performance with significantly less data. Proto-RM leverages prototypical networks to enable stable and reliable structural learning from fewer samples, improving the adaptability and accuracy of large language models (LLMs) in interpreting human preferences. The framework aggregates similar examples from the embedding space into prototypes, which are then used to fine-tune the reward model. This approach allows the reward model to learn stable and reliable data representation structures with limited sample sizes, making it particularly suitable for scenarios with limited data and complex human preferences. The proposed method consists of three key steps: sample encoding and prototype initialization, prototype update and addition, and reward model fine-tuning. During sample encoding, the reward model encodes samples, which are then used to initialize prototypes. The prototypes are refined based on their distances to the samples, and the reward model's parameters are updated accordingly. Finally, the refined prototypes and encodings are used to train the reward model, which aims to accurately evaluate and guide the outputs of the language model. Experiments on various datasets demonstrate that Proto-RM significantly improves the performance of reward models and LLMs in human feedback tasks, achieving comparable or better results than traditional methods while requiring significantly less data. The method is particularly effective in data-limited scenarios, showing clear advantages and achieving the effectiveness of training with extensive data even when using limited samples. The results indicate that Proto-RM is a promising direction for enhancing the efficiency of reward models and optimizing the fine-tuning of language models under restricted feedback conditions.
Reach us at info@study.space