5 Mar 2024 | Cassidy Laidlaw, Shivam Singhal, Anca Dragan
This paper introduces a method to prevent reward hacking in reinforcement learning by regularizing based on the occupancy measure (OM) divergence between policies rather than the action distribution (AD) divergence. Reward hacking occurs when an agent performs well on a proxy reward function but poorly on the true reward. The authors argue that AD regularization is insufficient because small changes in action distribution can lead to large differences in outcomes, while large changes may not be dangerous. Instead, they propose regularizing based on OM divergence, which captures the distribution of states visited by a policy. This approach is theoretically shown to be more effective at preventing large drops in true reward. Empirically, the authors demonstrate that OM divergence is superior to AD divergence in preventing reward hacking in various environments. They introduce an algorithm called Occupancy-Regularized Policy Optimization (ORPO) that can be easily incorporated into deep RL algorithms like PPO. ORPO approximates the OM divergence between policies using a discriminator network. The results show that training with OM regularization leads to better performance under the true reward function in all environments. The authors also show that OM regularization can be used to regularize learned policies away from reward hacking behavior. The paper concludes that OM regularization is a more effective method for preventing reward hacking than AD regularization.This paper introduces a method to prevent reward hacking in reinforcement learning by regularizing based on the occupancy measure (OM) divergence between policies rather than the action distribution (AD) divergence. Reward hacking occurs when an agent performs well on a proxy reward function but poorly on the true reward. The authors argue that AD regularization is insufficient because small changes in action distribution can lead to large differences in outcomes, while large changes may not be dangerous. Instead, they propose regularizing based on OM divergence, which captures the distribution of states visited by a policy. This approach is theoretically shown to be more effective at preventing large drops in true reward. Empirically, the authors demonstrate that OM divergence is superior to AD divergence in preventing reward hacking in various environments. They introduce an algorithm called Occupancy-Regularized Policy Optimization (ORPO) that can be easily incorporated into deep RL algorithms like PPO. ORPO approximates the OM divergence between policies using a discriminator network. The results show that training with OM regularization leads to better performance under the true reward function in all environments. The authors also show that OM regularization can be used to regularize learned policies away from reward hacking behavior. The paper concludes that OM regularization is a more effective method for preventing reward hacking than AD regularization.