Understanding Off-Policy Deep Reinforcement Learning without Exploration

This paper addresses the challenge of off-policy deep reinforcement learning in a fixed batch setting, where agents must learn from a pre-collected dataset without further interaction with the environment. Standard off-policy algorithms, such as DQN and DDPG, are ineffective in this setting due to extrapolation errors, which occur when unseen state-action pairs are incorrectly estimated. The authors introduce a novel class of algorithms called batch-constrained reinforcement learning, which restricts the action space to force the agent to behave similarly to on-policy actions. They present Batch-Constrained deep Q-learning (BCQ), the first continuous control algorithm capable of learning from arbitrary, fixed batch data without exploration. BCQ uses a state-conditioned generative model to produce actions that are similar to those in the batch, ensuring that the policy induces a similar state-action visitation to the batch. The authors demonstrate the effectiveness of BCQ through experiments in MuJoCo environments, showing that it outperforms other algorithms in terms of stability and performance, even in the presence of noisy and multi-modal data. The paper also discusses the theoretical properties of batch-constrained policies and provides a detailed algorithm description.This paper addresses the challenge of off-policy deep reinforcement learning in a fixed batch setting, where agents must learn from a pre-collected dataset without further interaction with the environment. Standard off-policy algorithms, such as DQN and DDPG, are ineffective in this setting due to extrapolation errors, which occur when unseen state-action pairs are incorrectly estimated. The authors introduce a novel class of algorithms called batch-constrained reinforcement learning, which restricts the action space to force the agent to behave similarly to on-policy actions. They present Batch-Constrained deep Q-learning (BCQ), the first continuous control algorithm capable of learning from arbitrary, fixed batch data without exploration. BCQ uses a state-conditioned generative model to produce actions that are similar to those in the batch, ensuring that the policy induces a similar state-action visitation to the batch. The authors demonstrate the effectiveness of BCQ through experiments in MuJoCo environments, showing that it outperforms other algorithms in terms of stability and performance, even in the presence of noisy and multi-modal data. The paper also discusses the theoretical properties of batch-constrained policies and provides a detailed algorithm description.

Off-Policy Deep Reinforcement Learning without Exploration

10 Aug 2019 | Scott Fujimoto, David Meger, Doina Precup