Understanding Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

The paper "Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation" addresses the issue of reward overoptimization in Reinforcement Learning from Human Feedback (RLHF). The authors propose a lightweight uncertainty quantification method that uses only the last layer embeddings of the reward model to assess the reliability of the proxy reward. This method is integrated into the RLHF pipeline to provide efficient uncertainty estimates. Building on these estimates, they introduce AdvPO (Adversarial Policy Optimization), a distributionally robust optimization procedure that aims to mitigate overoptimization during policy improvement. AdvPO contrasts with previous sample-wise uncertainty penalization methods by handling reward uncertainty in a less pessimistic manner, leading to enhanced policy performance. Extensive experiments on the Anthropic HH and TL:DR summarization datasets demonstrate the effectiveness of AdvPO in reducing overoptimization and improving policy performance, as evaluated through human-assisted assessments. The contributions of the paper include a lightweight uncertainty estimation method, the AdvPO framework, and empirical validation of its effectiveness in real-world scenarios.The paper "Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation" addresses the issue of reward overoptimization in Reinforcement Learning from Human Feedback (RLHF). The authors propose a lightweight uncertainty quantification method that uses only the last layer embeddings of the reward model to assess the reliability of the proxy reward. This method is integrated into the RLHF pipeline to provide efficient uncertainty estimates. Building on these estimates, they introduce AdvPO (Adversarial Policy Optimization), a distributionally robust optimization procedure that aims to mitigate overoptimization during policy improvement. AdvPO contrasts with previous sample-wise uncertainty penalization methods by handling reward uncertainty in a less pessimistic manner, leading to enhanced policy performance. Extensive experiments on the Anthropic HH and TL:DR summarization datasets demonstrate the effectiveness of AdvPO in reducing overoptimization and improving policy performance, as evaluated through human-assisted assessments. The contributions of the paper include a lightweight uncertainty estimation method, the AdvPO framework, and empirical validation of its effectiveness in real-world scenarios.

OVERCOMING REWARD OVEROPTIMIZATION VIA ADVERSARIAL POLICY OPTIMIZATION WITH LIGHTWEIGHT UNCERTAINTY ESTIMATION

9 Jul 2024 | Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, Yang Liu