This paper investigates conditions under which modifications to the reward function of a Markov decision process (MDP) preserve the optimal policy. It shows that, besides the positive linear transformation familiar from utility theory, one can add a reward for transitions between states that is expressible as the difference in value of an arbitrary potential function applied to those states. Furthermore, this is shown to be a necessary condition for invariance, in the sense that any other transformation may yield suboptimal policies unless further assumptions are made about the underlying MDP. These results shed light on the practice of reward shaping, a method used in reinforcement learning to guide the learning agent. The paper demonstrates that potential-based shaping rewards can lead to substantial reductions in learning time.
The paper focuses on reward shaping, which has the potential to be a powerful technique for scaling up reinforcement learning methods to handle complex problems. It shows that shaping rewards must obey certain conditions to avoid misleading the agent into learning suboptimal policies. The paper introduces a formal framework for shaping rewards and shows that potential-based shaping functions are necessary and sufficient for policy invariance under reward transformations. It also provides methods for constructing shaping potentials corresponding to distance-based and subgoal-based heuristics.
The paper presents experiments demonstrating the effectiveness of potential-based shaping in speeding up learning on simple domains. It also discusses the connection between potential-based shaping and existing algorithms such as Advantage learning and λ-policy iteration. The paper concludes that potential-based shaping is a robust and effective method for guiding learning in reinforcement learning tasks.This paper investigates conditions under which modifications to the reward function of a Markov decision process (MDP) preserve the optimal policy. It shows that, besides the positive linear transformation familiar from utility theory, one can add a reward for transitions between states that is expressible as the difference in value of an arbitrary potential function applied to those states. Furthermore, this is shown to be a necessary condition for invariance, in the sense that any other transformation may yield suboptimal policies unless further assumptions are made about the underlying MDP. These results shed light on the practice of reward shaping, a method used in reinforcement learning to guide the learning agent. The paper demonstrates that potential-based shaping rewards can lead to substantial reductions in learning time.
The paper focuses on reward shaping, which has the potential to be a powerful technique for scaling up reinforcement learning methods to handle complex problems. It shows that shaping rewards must obey certain conditions to avoid misleading the agent into learning suboptimal policies. The paper introduces a formal framework for shaping rewards and shows that potential-based shaping functions are necessary and sufficient for policy invariance under reward transformations. It also provides methods for constructing shaping potentials corresponding to distance-based and subgoal-based heuristics.
The paper presents experiments demonstrating the effectiveness of potential-based shaping in speeding up learning on simple domains. It also discusses the connection between potential-based shaping and existing algorithms such as Advantage learning and λ-policy iteration. The paper concludes that potential-based shaping is a robust and effective method for guiding learning in reinforcement learning tasks.