Addressing Function Approximation Error in Actor-Critic Methods

Addressing Function Approximation Error in Actor-Critic Methods

2018 | Scott Fujimoto, Herke van Hoof, David Meger
The paper addresses the issue of function approximation errors in actor-critic methods, which can lead to overestimated value estimates and suboptimal policies. It proposes a novel mechanism to minimize these effects by adapting the Double Q-learning approach to an actor-critic setting. The authors introduce a clipped Double Q-learning variant that uses a pair of independently trained critics to limit overestimation. They also suggest delaying policy updates to reduce per-update error and improve performance. The method is evaluated on various OpenAI gym tasks, outperforming state-of-the-art methods. The paper includes a detailed analysis of overestimation bias in actor-critic methods, the effectiveness of target networks in reducing error accumulation, and a regularization strategy inspired by SARSA to reduce variance. The proposed algorithm, Twin Delayed Deep Deterministic Policy Gradient (TD3), is shown to significantly improve both learning speed and performance in continuous control tasks.The paper addresses the issue of function approximation errors in actor-critic methods, which can lead to overestimated value estimates and suboptimal policies. It proposes a novel mechanism to minimize these effects by adapting the Double Q-learning approach to an actor-critic setting. The authors introduce a clipped Double Q-learning variant that uses a pair of independently trained critics to limit overestimation. They also suggest delaying policy updates to reduce per-update error and improve performance. The method is evaluated on various OpenAI gym tasks, outperforming state-of-the-art methods. The paper includes a detailed analysis of overestimation bias in actor-critic methods, the effectiveness of target networks in reducing error accumulation, and a regularization strategy inspired by SARSA to reduce variance. The proposed algorithm, Twin Delayed Deep Deterministic Policy Gradient (TD3), is shown to significantly improve both learning speed and performance in continuous control tasks.
Reach us at info@study.space
[slides and audio] Addressing Function Approximation Error in Actor-Critic Methods