2018 | Scott Fujimoto, Herke van Hoof, David Meger
This paper addresses the issue of overestimation bias in actor-critic methods for continuous control tasks. The authors show that overestimation bias, a well-known problem in discrete action settings, persists in actor-critic methods and propose novel mechanisms to mitigate its effects. They build on Double Q-learning by using a pair of independently trained critics to minimize overestimation. They also suggest delaying policy updates to reduce per-update error and improve performance. The proposed algorithm, Twin Delayed Deep Deterministic Policy Gradient (TD3), outperforms the state of the art on a variety of OpenAI Gym tasks. The authors also introduce a regularization strategy that reduces variance by bootstrapping similar action estimates. TD3 is evaluated on seven continuous control domains and is shown to significantly outperform existing methods. The paper also discusses the connection between target networks and overestimation bias, and proposes a novel variant of Double Q-learning that limits overestimation by taking the minimum of two value estimates. The authors conclude that mitigating overestimation can greatly improve the performance of modern algorithms.This paper addresses the issue of overestimation bias in actor-critic methods for continuous control tasks. The authors show that overestimation bias, a well-known problem in discrete action settings, persists in actor-critic methods and propose novel mechanisms to mitigate its effects. They build on Double Q-learning by using a pair of independently trained critics to minimize overestimation. They also suggest delaying policy updates to reduce per-update error and improve performance. The proposed algorithm, Twin Delayed Deep Deterministic Policy Gradient (TD3), outperforms the state of the art on a variety of OpenAI Gym tasks. The authors also introduce a regularization strategy that reduces variance by bootstrapping similar action estimates. TD3 is evaluated on seven continuous control domains and is shown to significantly outperform existing methods. The paper also discusses the connection between target networks and overestimation bias, and proposes a novel variant of Double Q-learning that limits overestimation by taking the minimum of two value estimates. The authors conclude that mitigating overestimation can greatly improve the performance of modern algorithms.