Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

25 Nov 2019 | Aviral Kumar, Justin Fu, George Tucker, Sergey Levine
This paper introduces a new algorithm called BEAR (Bootstrapping Error Accumulation Reduction) for off-policy reinforcement learning (RL) that addresses the instability caused by out-of-distribution (OOD) actions in Q-learning. The key challenge in off-policy RL is that the policy is trained on data collected from a different distribution, leading to bootstrapping errors that accumulate during the Bellman backup process. These errors can cause the learning process to become unstable and diverge. The authors analyze the source of instability in off-policy RL and show that bootstrapping errors arise from evaluating the Q-function on actions that are not present in the training data distribution. They propose a practical algorithm, BEAR, which constrains the policy to act only on actions that are supported by the training data distribution. This approach reduces the risk of evaluating the Q-function on OOD actions, thereby mitigating bootstrapping errors. BEAR is based on a distribution-constrained backup operator that restricts the set of policies used in the maximization of the Q-function. This operator ensures that the policy remains within the support of the training data distribution, preventing the evaluation of the Q-function on OOD actions. The algorithm uses a combination of Q-functions and a constraint that ensures the policy remains within the support of the training data distribution. The authors demonstrate that BEAR is effective in learning from various off-policy datasets, including random and suboptimal demonstrations, on a range of continuous control tasks. They show that BEAR outperforms existing methods such as BCQ (Batch-Constrained Q-learning) and naive off-policy RL in terms of performance and stability. The algorithm is evaluated on a variety of continuous control benchmark tasks and shows consistent performance across different dataset compositions. The paper also provides theoretical analysis of the error propagation in off-policy RL and shows that the proposed method effectively reduces the impact of bootstrapping errors. The authors conclude that BEAR is a robust and effective method for off-policy RL, capable of learning from static datasets and performing well on a range of continuous control tasks.This paper introduces a new algorithm called BEAR (Bootstrapping Error Accumulation Reduction) for off-policy reinforcement learning (RL) that addresses the instability caused by out-of-distribution (OOD) actions in Q-learning. The key challenge in off-policy RL is that the policy is trained on data collected from a different distribution, leading to bootstrapping errors that accumulate during the Bellman backup process. These errors can cause the learning process to become unstable and diverge. The authors analyze the source of instability in off-policy RL and show that bootstrapping errors arise from evaluating the Q-function on actions that are not present in the training data distribution. They propose a practical algorithm, BEAR, which constrains the policy to act only on actions that are supported by the training data distribution. This approach reduces the risk of evaluating the Q-function on OOD actions, thereby mitigating bootstrapping errors. BEAR is based on a distribution-constrained backup operator that restricts the set of policies used in the maximization of the Q-function. This operator ensures that the policy remains within the support of the training data distribution, preventing the evaluation of the Q-function on OOD actions. The algorithm uses a combination of Q-functions and a constraint that ensures the policy remains within the support of the training data distribution. The authors demonstrate that BEAR is effective in learning from various off-policy datasets, including random and suboptimal demonstrations, on a range of continuous control tasks. They show that BEAR outperforms existing methods such as BCQ (Batch-Constrained Q-learning) and naive off-policy RL in terms of performance and stability. The algorithm is evaluated on a variety of continuous control benchmark tasks and shows consistent performance across different dataset compositions. The paper also provides theoretical analysis of the error propagation in off-policy RL and shows that the proposed method effectively reduces the impact of bootstrapping errors. The authors conclude that BEAR is a robust and effective method for off-policy RL, capable of learning from static datasets and performing well on a range of continuous control tasks.
Reach us at info@study.space
[slides and audio] Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction