Trust Region Policy Optimization

Trust Region Policy Optimization

20 Apr 2017 | John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, Pieter Abbeel
The paper introduces Trust Region Policy Optimization (TRPO), a practical algorithm for optimizing policies in reinforcement learning. TRPO is derived from a theoretically justified procedure that guarantees monotonic improvement in policy performance. The key idea is to use a surrogate objective function and a constraint on the KL divergence between the new and old policies to ensure robust and efficient updates. The authors demonstrate the effectiveness of TRPO through experiments on simulated robotic tasks (swimming, hopping, and walking) and Atari games, showing that it can learn complex policies with high-quality performance and minimal hyperparameter tuning. The paper also discusses the theoretical guarantees and connections to other policy optimization methods, providing a unified perspective on various policy update schemes.The paper introduces Trust Region Policy Optimization (TRPO), a practical algorithm for optimizing policies in reinforcement learning. TRPO is derived from a theoretically justified procedure that guarantees monotonic improvement in policy performance. The key idea is to use a surrogate objective function and a constraint on the KL divergence between the new and old policies to ensure robust and efficient updates. The authors demonstrate the effectiveness of TRPO through experiments on simulated robotic tasks (swimming, hopping, and walking) and Atari games, showing that it can learn complex policies with high-quality performance and minimal hyperparameter tuning. The paper also discusses the theoretical guarantees and connections to other policy optimization methods, providing a unified perspective on various policy update schemes.
Reach us at info@study.space