28 Aug 2017 | John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov
The paper introduces Proximal Policy Optimization (PPO), a new family of policy gradient methods for reinforcement learning. PPO alternates between sampling data from the current policy and optimizing a "surrogate" objective function using stochastic gradient ascent. Unlike standard policy gradient methods that perform one gradient update per data sample, PPO allows for multiple epochs of minibatch updates. The key innovation is a clipped probability ratio objective, which forms a pessimistic estimate of the policy's performance. This objective penalizes large policy updates and ensures stability. PPO is simpler to implement, more general, and empirically outperforms other online policy gradient methods, particularly in terms of sample complexity. Experiments on various benchmark tasks, including simulated robotic locomotion and Atari game playing, demonstrate PPO's superior performance compared to other algorithms like TRPO, A2C, and ACER.The paper introduces Proximal Policy Optimization (PPO), a new family of policy gradient methods for reinforcement learning. PPO alternates between sampling data from the current policy and optimizing a "surrogate" objective function using stochastic gradient ascent. Unlike standard policy gradient methods that perform one gradient update per data sample, PPO allows for multiple epochs of minibatch updates. The key innovation is a clipped probability ratio objective, which forms a pessimistic estimate of the policy's performance. This objective penalizes large policy updates and ensures stability. PPO is simpler to implement, more general, and empirically outperforms other online policy gradient methods, particularly in terms of sample complexity. Experiments on various benchmark tasks, including simulated robotic locomotion and Atari game playing, demonstrate PPO's superior performance compared to other algorithms like TRPO, A2C, and ACER.