25 Nov 2019 | Aviral Kumar, Justin Fu, George Tucker, Sergey Levine
The paper "Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction" by Aviral Kumar addresses the issue of instability in off-policy reinforcement learning (RL) methods, which are sensitive to the data distribution and struggle to learn effectively without additional on-policy data. The authors identify *bootstrapping error* as a key source of instability, which occurs when the Bellman backup operator is applied to actions outside the training data distribution. They theoretically analyze this error and propose a practical algorithm called *Bootstrapping Error Accumulation Reduction* (BEAR) to mitigate it. BEAR constrains action selection in the backup process to prevent error accumulation, ensuring that the learned policy remains within the support of the training distribution. The paper demonstrates that BEAR can robustly learn from various off-policy distributions, including random and suboptimal demonstrations, on continuous control tasks. The authors also provide a detailed analysis of the theoretical properties of BEAR and compare it with existing methods, showing that BEAR outperforms state-of-the-art algorithms in terms of stability and performance.The paper "Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction" by Aviral Kumar addresses the issue of instability in off-policy reinforcement learning (RL) methods, which are sensitive to the data distribution and struggle to learn effectively without additional on-policy data. The authors identify *bootstrapping error* as a key source of instability, which occurs when the Bellman backup operator is applied to actions outside the training data distribution. They theoretically analyze this error and propose a practical algorithm called *Bootstrapping Error Accumulation Reduction* (BEAR) to mitigate it. BEAR constrains action selection in the backup process to prevent error accumulation, ensuring that the learned policy remains within the support of the training distribution. The paper demonstrates that BEAR can robustly learn from various off-policy distributions, including random and suboptimal demonstrations, on continuous control tasks. The authors also provide a detailed analysis of the theoretical properties of BEAR and compare it with existing methods, showing that BEAR outperforms state-of-the-art algorithms in terms of stability and performance.