HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION

HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION

20 Oct 2018 | John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan and Pieter Abbeel
This paper introduces a method for high-dimensional continuous control using generalized advantage estimation (GAE). The authors propose a policy gradient method that reduces the variance of policy gradient estimates by using an exponentially-weighted estimator of the advantage function, which is analogous to TD(λ). They also use trust region optimization for both the policy and the value function, which are represented by neural networks. The method is applied to challenging 3D locomotion tasks, learning running gaits for simulated bipedal and quadrupedal robots, and learning a policy for a biped to stand up from lying on the ground. The neural network policies directly map raw kinematics to joint torques, avoiding the need for hand-crafted policy representations. The algorithm is fully model-free and requires only 1-2 weeks of real-time simulation for learning tasks on 3D bipeds. The paper also provides an interpretation of GAE as a form of reward shaping, where the approximate value function is used to shape the reward. The authors show that GAE achieves strong empirical results on these tasks, extending the state of the art in high-dimensional continuous control. The method is effective at learning neural network policies for challenging control tasks and is robust to the nonstationarity of the incoming data. The paper also discusses the relationship between value function estimation error and policy gradient estimation error, and suggests that future work could explore adaptive adjustment of estimator parameters. The results demonstrate that GAE is a promising approach for high-dimensional continuous control.This paper introduces a method for high-dimensional continuous control using generalized advantage estimation (GAE). The authors propose a policy gradient method that reduces the variance of policy gradient estimates by using an exponentially-weighted estimator of the advantage function, which is analogous to TD(λ). They also use trust region optimization for both the policy and the value function, which are represented by neural networks. The method is applied to challenging 3D locomotion tasks, learning running gaits for simulated bipedal and quadrupedal robots, and learning a policy for a biped to stand up from lying on the ground. The neural network policies directly map raw kinematics to joint torques, avoiding the need for hand-crafted policy representations. The algorithm is fully model-free and requires only 1-2 weeks of real-time simulation for learning tasks on 3D bipeds. The paper also provides an interpretation of GAE as a form of reward shaping, where the approximate value function is used to shape the reward. The authors show that GAE achieves strong empirical results on these tasks, extending the state of the art in high-dimensional continuous control. The method is effective at learning neural network policies for challenging control tasks and is robust to the nonstationarity of the incoming data. The paper also discusses the relationship between value function estimation error and policy gradient estimation error, and suggests that future work could explore adaptive adjustment of estimator parameters. The results demonstrate that GAE is a promising approach for high-dimensional continuous control.
Reach us at info@study.space
[slides] High-Dimensional Continuous Control Using Generalized Advantage Estimation | StudySpace