[slides] Bayesian Reward Models for LLM Alignment

The paper addresses the issue of reward overoptimization in large language models (LLMs) by proposing a Bayesian reward model trained using Laplace approximation on Low-Rank Adaptation (LoRA) weights. This approach aims to mitigate reward overoptimization, which occurs when responses receive high rewards due to imperfections in the reward model rather than true preference. The authors trained Bayesian reward models using Laplace approximation on LoRA weights and found that the resulting uncertainty estimates effectively reduce reward overoptimization in best-of-$n$ (BoN) sampling. The paper begins by discussing the importance of aligning LLMs with human preferences to ensure safety and helpfulness. It highlights the challenges of reward overoptimization, particularly in out-of-distribution (OOD) regions, and provides examples of overoptimization in reinforcement learning from human feedback (RLHF). The authors then introduce Bayesian deep learning as a solution to address distribution shifts and overconfidence in deep neural networks, specifically focusing on Laplace-LoRA, a scalable Bayesian approximation technique. The method involves integrating uncertainty quantification into the reward modeling process through Laplace-LoRA. This technique provides a Gaussian distribution over reward outputs, allowing for the estimation of epistemic uncertainty. The authors propose two uncertainty penalties—standard deviation-based and variance-based—to reduce the impact of overconfident predictions. These penalties are applied to both single reward models and reward model ensembles, demonstrating their effectiveness in mitigating reward overoptimization. The experimental setup includes a synthetic labeling strategy and the use of an oracle gold reward model to provide synthetic labels for training proxy reward models. The results show that the proposed method significantly improves the performance of BoN sampling and RLHF, achieving higher gold reward scores without the need for KL penalties. In conclusion, the paper demonstrates that using Laplace-LoRA to quantify uncertainty in reward models can effectively mitigate reward overoptimization, offering gains over standard methods. This approach highlights the potential of Bayesian approaches in providing reliable and safer alignment of LLMs.The paper addresses the issue of reward overoptimization in large language models (LLMs) by proposing a Bayesian reward model trained using Laplace approximation on Low-Rank Adaptation (LoRA) weights. This approach aims to mitigate reward overoptimization, which occurs when responses receive high rewards due to imperfections in the reward model rather than true preference. The authors trained Bayesian reward models using Laplace approximation on LoRA weights and found that the resulting uncertainty estimates effectively reduce reward overoptimization in best-of-$n$ (BoN) sampling. The paper begins by discussing the importance of aligning LLMs with human preferences to ensure safety and helpfulness. It highlights the challenges of reward overoptimization, particularly in out-of-distribution (OOD) regions, and provides examples of overoptimization in reinforcement learning from human feedback (RLHF). The authors then introduce Bayesian deep learning as a solution to address distribution shifts and overconfidence in deep neural networks, specifically focusing on Laplace-LoRA, a scalable Bayesian approximation technique. The method involves integrating uncertainty quantification into the reward modeling process through Laplace-LoRA. This technique provides a Gaussian distribution over reward outputs, allowing for the estimation of epistemic uncertainty. The authors propose two uncertainty penalties—standard deviation-based and variance-based—to reduce the impact of overconfident predictions. These penalties are applied to both single reward models and reward model ensembles, demonstrating their effectiveness in mitigating reward overoptimization. The experimental setup includes a synthetic labeling strategy and the use of an oracle gold reward model to provide synthetic labels for training proxy reward models. The results show that the proposed method significantly improves the performance of BoN sampling and RLHF, achieving higher gold reward scores without the need for KL penalties. In conclusion, the paper demonstrates that using Laplace-LoRA to quantify uncertainty in reward models can effectively mitigate reward overoptimization, offering gains over standard methods. This approach highlights the potential of Bayesian approaches in providing reliable and safer alignment of LLMs.

Bayesian Reward Models for LLM Alignment

2024 | Adam X. Yang, Maxime Robeys, Thomas Coste, Zhengyan Shi, Jun Wang, Haitham Bou Ammar, Laurence Aitchison