[slides and audio] Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

The paper "Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble" addresses the issue of reward model overoptimization in Reinforcement Learning from Human Feedback (RLHF), which can lead to outputs that are misaligned with human values. The authors propose an efficient reward model ensemble method to improve the accuracy of reward predictions. They introduce two ensemble approaches: linear-layer ensemble and LoRA-based ensemble, which are designed to balance computational efficiency and alignment performance. Empirical evaluations using the AlpacaEval and MT-Bench benchmarks show that the proposed ensemble methods significantly enhance the alignment performance of RLHF outputs. The results demonstrate that the ensemble methods outperform standard RLHF algorithms, particularly in terms of improving the alignment of large language models with human values.The paper "Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble" addresses the issue of reward model overoptimization in Reinforcement Learning from Human Feedback (RLHF), which can lead to outputs that are misaligned with human values. The authors propose an efficient reward model ensemble method to improve the accuracy of reward predictions. They introduce two ensemble approaches: linear-layer ensemble and LoRA-based ensemble, which are designed to balance computational efficiency and alignment performance. Empirical evaluations using the AlpacaEval and MT-Bench benchmarks show that the proposed ensemble methods significantly enhance the alignment performance of RLHF outputs. The results demonstrate that the ensemble methods outperform standard RLHF algorithms, particularly in terms of improving the alignment of large language models with human values.

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

21 May 2024 | Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan