[slides] Policy Invariance Under Reward Transformations%3A Theory and Application to Reward Shaping

This paper investigates the conditions under which modifications to the reward function in a Markov decision process (MDP) preserve the optimal policy. It is shown that, in addition to the positive linear transformations familiar from utility theory, one can add a reward for transitions between states that is expressible as the difference in value of an arbitrary potential function applied to those states. This is proven to be both a sufficient and necessary condition for policy invariance, meaning that any other transformation may yield suboptimal policies unless further assumptions are made about the underlying MDP. These findings have implications for *reward shaping*, a technique used in reinforcement learning to guide the learning agent with additional training rewards. The paper identifies common "bugs" in reward shaping procedures that arise from non-potential-based rewards and provides methods for constructing shaping potentials corresponding to distance-based and subgoal-based heuristics. Such potentials can significantly reduce learning time. The paper also discusses the connection between these results and existing algorithms like Advantage learning and λ-policy iteration, and concludes with a discussion on future work.This paper investigates the conditions under which modifications to the reward function in a Markov decision process (MDP) preserve the optimal policy. It is shown that, in addition to the positive linear transformations familiar from utility theory, one can add a reward for transitions between states that is expressible as the difference in value of an arbitrary potential function applied to those states. This is proven to be both a sufficient and necessary condition for policy invariance, meaning that any other transformation may yield suboptimal policies unless further assumptions are made about the underlying MDP. These findings have implications for *reward shaping*, a technique used in reinforcement learning to guide the learning agent with additional training rewards. The paper identifies common "bugs" in reward shaping procedures that arise from non-potential-based rewards and provides methods for constructing shaping potentials corresponding to distance-based and subgoal-based heuristics. Such potentials can significantly reduce learning time. The paper also discusses the connection between these results and existing algorithms like Advantage learning and λ-policy iteration, and concludes with a discussion on future work.

Policy invariance under reward transformations: Theory and application to reward shaping

| Andrew Y. Ng, Daishi Harada, Stuart Russell