9 Jul 2024 | Xiaoying Zhang, Jean-François Ton, Wei Shen, Hongning Wang, Yang Liu
This paper introduces ADVPO, a novel method to address reward overoptimization in Reinforcement Learning from Human Feedback (RLHF). The core idea is to use a lightweight uncertainty estimation method based on the last layer embeddings of the reward model to quantify reward uncertainty. This method allows for efficient and effective mitigation of overoptimization without the need for computationally expensive ensembles. The proposed method, ADVPO, is a distributionally robust optimization procedure that contrasts with previous sample-wise uncertainty penalization methods by handling reward uncertainty in a less pessimistic manner. This approach is more effective at improving policy performance and mitigating overoptimization, as demonstrated through extensive experiments on the Anthropic HH and TL;DR summarization datasets. The results show that ADVPO leads to improved policy performance and better alignment with human preferences, as evaluated through human-assisted assessments. The lightweight uncertainty estimation method is integrated into existing reward models, making it generally applicable. The paper also discusses the limitations of current approaches to reward overoptimization and highlights the advantages of the proposed method in terms of computational efficiency and effectiveness.This paper introduces ADVPO, a novel method to address reward overoptimization in Reinforcement Learning from Human Feedback (RLHF). The core idea is to use a lightweight uncertainty estimation method based on the last layer embeddings of the reward model to quantify reward uncertainty. This method allows for efficient and effective mitigation of overoptimization without the need for computationally expensive ensembles. The proposed method, ADVPO, is a distributionally robust optimization procedure that contrasts with previous sample-wise uncertainty penalization methods by handling reward uncertainty in a less pessimistic manner. This approach is more effective at improving policy performance and mitigating overoptimization, as demonstrated through extensive experiments on the Anthropic HH and TL;DR summarization datasets. The results show that ADVPO leads to improved policy performance and better alignment with human preferences, as evaluated through human-assisted assessments. The lightweight uncertainty estimation method is integrated into existing reward models, making it generally applicable. The paper also discusses the limitations of current approaches to reward overoptimization and highlights the advantages of the proposed method in terms of computational efficiency and effectiveness.