Emergence of Locomotion Behaviours in Rich Environments

Emergence of Locomotion Behaviours in Rich Environments

10 Jul 2017 | Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin Riedmiller, David Silver
This paper explores how a rich environment can promote the learning of complex behaviors in reinforcement learning. The authors train agents in diverse environmental contexts and find that this encourages the emergence of robust behaviors that perform well across a suite of tasks. They demonstrate this principle for locomotion, where behaviors are sensitive to the choice of reward. The agents are trained on a diverse set of challenging terrains and obstacles using a simple reward function based on forward progress. A novel scalable variant of policy gradient reinforcement learning is used, allowing agents to learn to run, jump, crouch, and turn without explicit reward-based guidance. The paper discusses the challenges of reinforcement learning, particularly in continuous control tasks like locomotion, where the appropriate reward function is often non-obvious. The authors argue that rich and robust behaviors can emerge from simple reward functions if the environment itself contains sufficient richness and diversity. They propose that presenting agents with a diversity of challenges increases the performance gap between different solutions and may favor the learning of solutions that are robust across settings. The authors focus on a set of novel locomotion tasks that go significantly beyond the previous state-of-the-art for agents trained directly from reinforcement learning. These tasks include a variety of obstacle courses for agents with different bodies (Quadruped, Planar Walker, and Humanoid). The courses are procedurally generated such that every episode presents a different instance of the task. The environments include a wide range of obstacles with varying levels of difficulty. The variations in difficulty present an implicit curriculum to the agent – as it increases its capabilities it is able to overcome increasingly hard challenges, resulting in the emergence of ostensibly sophisticated locomotion skills which may naïvely have seemed to require careful reward design or other instruction. The authors also show that learning speed can be improved by explicitly structuring terrains to gradually increase in difficulty so that the agent faces easier obstacles first and harder obstacles only when it has mastered the easy ones. To learn effectively in these rich and challenging domains, the authors use a reliable and scalable reinforcement learning algorithm. They leverage components from several recent approaches to deep reinforcement learning. They build upon robust policy gradient algorithms, such as trust region policy optimization (TRPO) and proximal policy optimization (PPO), which bound parameter updates to a trust region to ensure stability. They also distribute the computation over many parallel instances of agent and environment. Their distributed implementation of PPO improves over TRPO in terms of wall clock time with little difference in robustness, and also improves over their existing implementation of A3C with continuous actions when the same number of workers is used. The paper proceeds as follows. In Section 2, the authors describe the distributed PPO (DPPO) algorithm that enables the subsequent experiments and validate its effectiveness empirically. In Section 3, they introduce the main experimental setup: a diverse set of challenging terrains and obstacles. They provide evidence in Section 4 that effective locomotion behaviours emerge directly from simple rewards; furthermore, they show that terrainsThis paper explores how a rich environment can promote the learning of complex behaviors in reinforcement learning. The authors train agents in diverse environmental contexts and find that this encourages the emergence of robust behaviors that perform well across a suite of tasks. They demonstrate this principle for locomotion, where behaviors are sensitive to the choice of reward. The agents are trained on a diverse set of challenging terrains and obstacles using a simple reward function based on forward progress. A novel scalable variant of policy gradient reinforcement learning is used, allowing agents to learn to run, jump, crouch, and turn without explicit reward-based guidance. The paper discusses the challenges of reinforcement learning, particularly in continuous control tasks like locomotion, where the appropriate reward function is often non-obvious. The authors argue that rich and robust behaviors can emerge from simple reward functions if the environment itself contains sufficient richness and diversity. They propose that presenting agents with a diversity of challenges increases the performance gap between different solutions and may favor the learning of solutions that are robust across settings. The authors focus on a set of novel locomotion tasks that go significantly beyond the previous state-of-the-art for agents trained directly from reinforcement learning. These tasks include a variety of obstacle courses for agents with different bodies (Quadruped, Planar Walker, and Humanoid). The courses are procedurally generated such that every episode presents a different instance of the task. The environments include a wide range of obstacles with varying levels of difficulty. The variations in difficulty present an implicit curriculum to the agent – as it increases its capabilities it is able to overcome increasingly hard challenges, resulting in the emergence of ostensibly sophisticated locomotion skills which may naïvely have seemed to require careful reward design or other instruction. The authors also show that learning speed can be improved by explicitly structuring terrains to gradually increase in difficulty so that the agent faces easier obstacles first and harder obstacles only when it has mastered the easy ones. To learn effectively in these rich and challenging domains, the authors use a reliable and scalable reinforcement learning algorithm. They leverage components from several recent approaches to deep reinforcement learning. They build upon robust policy gradient algorithms, such as trust region policy optimization (TRPO) and proximal policy optimization (PPO), which bound parameter updates to a trust region to ensure stability. They also distribute the computation over many parallel instances of agent and environment. Their distributed implementation of PPO improves over TRPO in terms of wall clock time with little difference in robustness, and also improves over their existing implementation of A3C with continuous actions when the same number of workers is used. The paper proceeds as follows. In Section 2, the authors describe the distributed PPO (DPPO) algorithm that enables the subsequent experiments and validate its effectiveness empirically. In Section 3, they introduce the main experimental setup: a diverse set of challenging terrains and obstacles. They provide evidence in Section 4 that effective locomotion behaviours emerge directly from simple rewards; furthermore, they show that terrains
Reach us at info@study.space