FeUDal Networks for Hierarchical Reinforcement Learning

FeUDal Networks for Hierarchical Reinforcement Learning

6 Mar 2017 | Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu
FeUdal Networks (FuNs) are a novel architecture for hierarchical reinforcement learning, inspired by feudal reinforcement learning. The approach decouples end-to-end learning across multiple levels, allowing the use of different time resolutions. The architecture consists of a Manager module and a Worker module. The Manager operates at a lower temporal resolution and sets abstract goals, which are conveyed to and enacted by the Worker. The Worker generates primitive actions at every environment tick. The decoupled structure of FuN facilitates long-term credit assignment and encourages the emergence of sub-policies associated with different goals. FuN significantly outperforms a strong baseline agent on tasks involving long-term credit assignment or memorization. The framework is tested on a range of tasks from the ATARI suite and a 3D DeepMind Lab environment. FuN is a modular neural network with two modules: the Worker and the Manager. The Manager computes a latent state representation and outputs a goal vector. The Worker produces actions conditioned on external observations, its own state, and the Manager's goal. The Manager and Worker share a perceptual module that takes an observation and computes a shared intermediate representation. The Manager's goals are trained using an approximate transition policy gradient. The Worker is trained via intrinsic reward to produce actions that achieve these goals. The Manager learns to select latent goals that maximize extrinsic reward. The key contributions of FuN include a consistent, end-to-end differentiable model that generalizes feudal reinforcement learning principles, a novel approximate transition policy gradient update for training the Manager, the use of directional rather than absolute goals, and a novel RNN design for the Manager – a dilated LSTM – which extends the longevity of the recurrent state memories and allows gradients to flow through large time hops, enabling effective back-propagation through hundreds of steps. FuN significantly improves long-term credit assignment and memorization on a selection of ATARI games, including Montezuma's Revenge, and on several memory tasks in the 3D DeepMind Lab environment. The architecture is compared with other methods, including the Option-Critic architecture, and shows superior performance in many tasks. The experiments demonstrate that FuN learns non-trivial, helpful, and interpretable sub-policies and sub-goals, and validates the design choices of the architecture. The results show that FuN outperforms a LSTM baseline in tasks requiring long-term credit assignment and visual memory. The ablation study confirms that transition policy gradient and directional goals are crucial for best performance. The architecture also shows potential for transfer and multitask learning, with learned behavioral primitives that can be reused to acquire new complex skills.FeUdal Networks (FuNs) are a novel architecture for hierarchical reinforcement learning, inspired by feudal reinforcement learning. The approach decouples end-to-end learning across multiple levels, allowing the use of different time resolutions. The architecture consists of a Manager module and a Worker module. The Manager operates at a lower temporal resolution and sets abstract goals, which are conveyed to and enacted by the Worker. The Worker generates primitive actions at every environment tick. The decoupled structure of FuN facilitates long-term credit assignment and encourages the emergence of sub-policies associated with different goals. FuN significantly outperforms a strong baseline agent on tasks involving long-term credit assignment or memorization. The framework is tested on a range of tasks from the ATARI suite and a 3D DeepMind Lab environment. FuN is a modular neural network with two modules: the Worker and the Manager. The Manager computes a latent state representation and outputs a goal vector. The Worker produces actions conditioned on external observations, its own state, and the Manager's goal. The Manager and Worker share a perceptual module that takes an observation and computes a shared intermediate representation. The Manager's goals are trained using an approximate transition policy gradient. The Worker is trained via intrinsic reward to produce actions that achieve these goals. The Manager learns to select latent goals that maximize extrinsic reward. The key contributions of FuN include a consistent, end-to-end differentiable model that generalizes feudal reinforcement learning principles, a novel approximate transition policy gradient update for training the Manager, the use of directional rather than absolute goals, and a novel RNN design for the Manager – a dilated LSTM – which extends the longevity of the recurrent state memories and allows gradients to flow through large time hops, enabling effective back-propagation through hundreds of steps. FuN significantly improves long-term credit assignment and memorization on a selection of ATARI games, including Montezuma's Revenge, and on several memory tasks in the 3D DeepMind Lab environment. The architecture is compared with other methods, including the Option-Critic architecture, and shows superior performance in many tasks. The experiments demonstrate that FuN learns non-trivial, helpful, and interpretable sub-policies and sub-goals, and validates the design choices of the architecture. The results show that FuN outperforms a LSTM baseline in tasks requiring long-term credit assignment and visual memory. The ablation study confirms that transition policy gradient and directional goals are crucial for best performance. The architecture also shows potential for transfer and multitask learning, with learned behavioral primitives that can be reused to acquire new complex skills.
Reach us at info@study.space
[slides and audio] FeUdal Networks for Hierarchical Reinforcement Learning