This paper introduces the MAXQ value function decomposition for hierarchical reinforcement learning. The MAXQ method decomposes a Markov decision process (MDP) into a hierarchy of subtasks and decomposes the value function into an additive combination of the value functions of the subtasks. This decomposition has both procedural and declarative semantics, representing a hierarchical policy's value function. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It assumes that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining subgoals, the programmer constrains the set of policies considered during reinforcement learning. The MAXQ decomposition can represent the value function of any policy consistent with the given hierarchy. It also enables state abstractions, allowing individual MDPs in the hierarchy to ignore large parts of the state space, which is important for practical applications.
The paper defines the MAXQ hierarchy, proves its representational power, and establishes five conditions for safely using state abstractions. It presents an online model-free learning algorithm, MAXQ-Q, which converges with probability 1 to a recursively optimal policy even with state abstractions. The paper evaluates MAXQ and MAXQ-Q through experiments in three domains, showing that MAXQ-Q converges faster to a recursively optimal policy than flat Q learning. The ability to learn a value function representation allows computing and executing improved, non-hierarchical policies via a procedure similar to policy improvement. The paper concludes with a comparison to related work and a discussion of design tradeoffs in hierarchical reinforcement learning.This paper introduces the MAXQ value function decomposition for hierarchical reinforcement learning. The MAXQ method decomposes a Markov decision process (MDP) into a hierarchy of subtasks and decomposes the value function into an additive combination of the value functions of the subtasks. This decomposition has both procedural and declarative semantics, representing a hierarchical policy's value function. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It assumes that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining subgoals, the programmer constrains the set of policies considered during reinforcement learning. The MAXQ decomposition can represent the value function of any policy consistent with the given hierarchy. It also enables state abstractions, allowing individual MDPs in the hierarchy to ignore large parts of the state space, which is important for practical applications.
The paper defines the MAXQ hierarchy, proves its representational power, and establishes five conditions for safely using state abstractions. It presents an online model-free learning algorithm, MAXQ-Q, which converges with probability 1 to a recursively optimal policy even with state abstractions. The paper evaluates MAXQ and MAXQ-Q through experiments in three domains, showing that MAXQ-Q converges faster to a recursively optimal policy than flat Q learning. The ability to learn a value function representation allows computing and executing improved, non-hierarchical policies via a procedure similar to policy improvement. The paper concludes with a comparison to related work and a discussion of design tradeoffs in hierarchical reinforcement learning.