Bayesian Reward Models for LLM Alignment

Bayesian Reward Models for LLM Alignment

2024 | Adam X. Yang, Maxime Robeyns, Thomas Coste, Zhengyan Shi, Jun Wang, Haitham Bou Ammar, Laurence Aitchison
This paper proposes a Bayesian reward model to address reward overoptimization in large language model (LLM) alignment. Reward overoptimization occurs when responses receive high rewards due to imperfections in the reward model rather than true preference, especially when prompts or responses deviate from the training data. To mitigate this, the authors introduce a Bayesian reward model that signals higher uncertainty further from the training data distribution. They train Bayesian reward models using Laplace approximation on LoRA weights and find that the resulting uncertainty estimates can effectively reduce reward overoptimization in best-of-n (BoN) sampling. The paper discusses the challenges of reward overoptimization in LLM alignment, where reward models trained on finite data may lead to overoptimization when used in BoN or reinforcement learning from human feedback (RLHF). The authors propose integrating uncertainty quantification through Laplace-LoRA, which provides a Gaussian distribution over reward outputs for each test prompt and response pair. This approach enables the reward model to estimate uncertainty in its predictions, helping to avoid overconfident responses, especially in out-of-distribution (OOD) scenarios. The authors also explore two methods for incorporating uncertainty penalties into reward estimation: standard deviation-based and variance-based penalties. These penalties reduce the reward for responses with higher uncertainty, promoting more conservative reward allocation. Additionally, the approach can be combined with reward ensembles, where multiple reward models are trained independently and their outputs are aggregated to provide a more robust optimization target. The experiments show that using Laplace-LoRA significantly improves the performance of reward models in BoN sampling, especially when evaluated against a gold-standard reward model. The results indicate that incorporating uncertainty estimates helps to more accurately reflect true preferences, particularly in OOD scenarios. The approach also performs well when combined with reward ensembles, demonstrating the effectiveness of integrating Bayesian methods in LLM alignment. The paper concludes that Bayesian approaches, such as Laplace-LoRA, offer a promising solution to mitigate reward overoptimization in LLM alignment, providing more reliable and safer alignment by incorporating uncertainty estimation.This paper proposes a Bayesian reward model to address reward overoptimization in large language model (LLM) alignment. Reward overoptimization occurs when responses receive high rewards due to imperfections in the reward model rather than true preference, especially when prompts or responses deviate from the training data. To mitigate this, the authors introduce a Bayesian reward model that signals higher uncertainty further from the training data distribution. They train Bayesian reward models using Laplace approximation on LoRA weights and find that the resulting uncertainty estimates can effectively reduce reward overoptimization in best-of-n (BoN) sampling. The paper discusses the challenges of reward overoptimization in LLM alignment, where reward models trained on finite data may lead to overoptimization when used in BoN or reinforcement learning from human feedback (RLHF). The authors propose integrating uncertainty quantification through Laplace-LoRA, which provides a Gaussian distribution over reward outputs for each test prompt and response pair. This approach enables the reward model to estimate uncertainty in its predictions, helping to avoid overconfident responses, especially in out-of-distribution (OOD) scenarios. The authors also explore two methods for incorporating uncertainty penalties into reward estimation: standard deviation-based and variance-based penalties. These penalties reduce the reward for responses with higher uncertainty, promoting more conservative reward allocation. Additionally, the approach can be combined with reward ensembles, where multiple reward models are trained independently and their outputs are aggregated to provide a more robust optimization target. The experiments show that using Laplace-LoRA significantly improves the performance of reward models in BoN sampling, especially when evaluated against a gold-standard reward model. The results indicate that incorporating uncertainty estimates helps to more accurately reflect true preferences, particularly in OOD scenarios. The approach also performs well when combined with reward ensembles, demonstrating the effectiveness of integrating Bayesian methods in LLM alignment. The paper concludes that Bayesian approaches, such as Laplace-LoRA, offer a promising solution to mitigate reward overoptimization in LLM alignment, providing more reliable and safer alignment by incorporating uncertainty estimation.
Reach us at info@study.space
[slides and audio] Bayesian Reward Models for LLM Alignment