21 May 2024 | Shun Zhang¹, Zhenfang Chen¹, Sunli Chen², Yikang Shen¹, Zhiqing Sun³, and Chuang Gan¹⁴
This paper proposes an efficient reward model ensemble method to improve the alignment of large language models (LLMs) through Reinforcement Learning from Human Feedback (RLHF). RLHF aligns LLMs with human values by first training a large language model with supervised fine-tuning (SFT), then training a reward model based on human preference data, and finally using reinforcement learning to fine-tune the SFT model. However, the reward model may not accurately predict preferences, leading to misaligned outputs. To address this, the authors propose ensemble methods that combine multiple reward models to improve prediction accuracy.
The paper explores two efficient ensemble approaches: linear-layer ensemble and LoRA-based ensemble. In the linear-layer ensemble, multiple reward models share the same Transformer model, with each having its own linear layer. In the LoRA-based ensemble, each reward model slightly fine-tunes the Transformer model using a LoRA adapter, which has a small number of parameters and can be trained efficiently. These methods reduce computational costs while improving reward prediction accuracy.
The authors evaluate their methods on AlpacaEval and MT-Bench benchmarks. They find that both single reward model ensembles and LoRA-based ensembles improve alignment performance compared to using a single reward model. For PPO, LoRA-based ensembles perform best, while for Best-of-n, both ensembles show strong performance. The results show that even though LoRA does not fully fine-tune the Transformer model, it is effective for reward model ensemble and improves alignment performance.
The paper also discusses the importance of uncertainty estimation in reward model ensembles and how conservative predictions can help mitigate reward overoptimization. The authors conclude that their efficient ensemble methods are effective in improving the alignment of LLMs under computational constraints. Future work will extend these methods to other steps of LLM training and inference.This paper proposes an efficient reward model ensemble method to improve the alignment of large language models (LLMs) through Reinforcement Learning from Human Feedback (RLHF). RLHF aligns LLMs with human values by first training a large language model with supervised fine-tuning (SFT), then training a reward model based on human preference data, and finally using reinforcement learning to fine-tune the SFT model. However, the reward model may not accurately predict preferences, leading to misaligned outputs. To address this, the authors propose ensemble methods that combine multiple reward models to improve prediction accuracy.
The paper explores two efficient ensemble approaches: linear-layer ensemble and LoRA-based ensemble. In the linear-layer ensemble, multiple reward models share the same Transformer model, with each having its own linear layer. In the LoRA-based ensemble, each reward model slightly fine-tunes the Transformer model using a LoRA adapter, which has a small number of parameters and can be trained efficiently. These methods reduce computational costs while improving reward prediction accuracy.
The authors evaluate their methods on AlpacaEval and MT-Bench benchmarks. They find that both single reward model ensembles and LoRA-based ensembles improve alignment performance compared to using a single reward model. For PPO, LoRA-based ensembles perform best, while for Best-of-n, both ensembles show strong performance. The results show that even though LoRA does not fully fine-tune the Transformer model, it is effective for reward model ensemble and improves alignment performance.
The paper also discusses the importance of uncertainty estimation in reward model ensembles and how conservative predictions can help mitigate reward overoptimization. The authors conclude that their efficient ensemble methods are effective in improving the alignment of LLMs under computational constraints. Future work will extend these methods to other steps of LLM training and inference.