Understanding Continuous control with deep reinforcement learning

The paper presents a model-free, off-policy actor-critic algorithm called Deep DPG (DDPG) that can learn policies in high-dimensional, continuous action spaces. DDPG is based on the deterministic policy gradient (DPG) algorithm but incorporates insights from the success of Deep Q-Learning (DQN). DQN, which uses deep neural networks to estimate the action-value function, can handle high-dimensional observation spaces but is limited to discrete and low-dimensional action spaces. DDPG addresses this by using a replay buffer and a target network to stabilize learning and allow for online learning with large, non-linear function approximators. The authors evaluate DDPG on a variety of challenging physical control tasks, including classic problems like cartpole swing-up, dexterous manipulation, legged locomotion, and car driving. They demonstrate that DDPG can robustly solve these tasks using low-dimensional observations and, in many cases, directly from raw pixel inputs. The algorithm's performance is competitive with that of a planning algorithm with full access to the dynamics of the domain and its derivatives. Key contributions of the paper include the combination of actor-critic methods with deep neural networks, the use of a replay buffer and target network to stabilize learning, and the application of batch normalization to handle different units and scales in observations. The results show that DDPG can learn effective policies in complex, high-dimensional environments, even when learning from raw pixels.The paper presents a model-free, off-policy actor-critic algorithm called Deep DPG (DDPG) that can learn policies in high-dimensional, continuous action spaces. DDPG is based on the deterministic policy gradient (DPG) algorithm but incorporates insights from the success of Deep Q-Learning (DQN). DQN, which uses deep neural networks to estimate the action-value function, can handle high-dimensional observation spaces but is limited to discrete and low-dimensional action spaces. DDPG addresses this by using a replay buffer and a target network to stabilize learning and allow for online learning with large, non-linear function approximators. The authors evaluate DDPG on a variety of challenging physical control tasks, including classic problems like cartpole swing-up, dexterous manipulation, legged locomotion, and car driving. They demonstrate that DDPG can robustly solve these tasks using low-dimensional observations and, in many cases, directly from raw pixel inputs. The algorithm's performance is competitive with that of a planning algorithm with full access to the dynamics of the domain and its derivatives. Key contributions of the paper include the combination of actor-critic methods with deep neural networks, the use of a replay buffer and target network to stabilize learning, and the application of batch normalization to handle different units and scales in observations. The results show that DDPG can learn effective policies in complex, high-dimensional environments, even when learning from raw pixels.

CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING

5 Jul 2019 | Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver & Daan Wierstra