Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition

11/99; published 11/00 | Thomas G. Dietterich
This paper introduces a new approach to hierarchical reinforcement learning, known as the MAXQ decomposition. The MAXQ decomposition breaks down the target Markov Decision Process (MDP) into a hierarchy of smaller MDPs and decomposes the value function of the target MDP into an additive combination of the value functions of these smaller MDPs. This decomposition has both procedural and declarative semantics, allowing for the representation of a hierarchical policy and the sharing of policies and value functions across subtasks. The paper discusses the advantages of this approach, including the ability to exploit state abstractions and the potential for faster learning. It presents an online model-free learning algorithm, MAXQ-Q, which converges to a recursively optimal policy, even with state abstractions. The paper evaluates the MAXQ representation and MAXQ-Q through experiments in three domains, showing that they outperform flat Q learning in terms of convergence speed. The paper also addresses design trade-offs and provides a formal framework for understanding the MAXQ method, including the definition of subtasks, state abstractions, and non-hierarchical execution.This paper introduces a new approach to hierarchical reinforcement learning, known as the MAXQ decomposition. The MAXQ decomposition breaks down the target Markov Decision Process (MDP) into a hierarchy of smaller MDPs and decomposes the value function of the target MDP into an additive combination of the value functions of these smaller MDPs. This decomposition has both procedural and declarative semantics, allowing for the representation of a hierarchical policy and the sharing of policies and value functions across subtasks. The paper discusses the advantages of this approach, including the ability to exploit state abstractions and the potential for faster learning. It presents an online model-free learning algorithm, MAXQ-Q, which converges to a recursively optimal policy, even with state abstractions. The paper evaluates the MAXQ representation and MAXQ-Q through experiments in three domains, showing that they outperform flat Q learning in terms of convergence speed. The paper also addresses design trade-offs and provides a formal framework for understanding the MAXQ method, including the definition of subtasks, state abstractions, and non-hierarchical execution.
Reach us at info@study.space
[slides and audio] Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition