2018 | Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, Gabriel Dulac-Arnold, John Agapiou, Joel Z. Leibo, Audrunas Gruslys
This paper introduces Deep Q-learning from Demonstrations (DQfD), an algorithm that leverages small sets of demonstration data to accelerate the learning process in deep reinforcement learning (RL). DQfD combines temporal difference (TD) updates with supervised classification of the demonstrator's actions, allowing the agent to pre-train on demonstration data and then continue learning with self-generated data. The algorithm uses a prioritized replay mechanism to automatically control the ratio of demonstration to self-generated data, which is critical for improving performance. Experimental results show that DQfD outperforms Prioritized Dueling Double Deep Q-Networks (PDD DQN) in 41 out of 42 games on the first million steps and on average, it takes PDD DQN 83 million steps to catch up. DQfD also outperforms pure imitation learning in 39 out of 42 games and learns to outperform the best demonstration in 14 out of 42 games. Additionally, DQfD achieves state-of-the-art results in 11 games using human demonstrations. The paper discusses the background of RL, related work, and the experimental setup, including the Arcade Learning Environment (ALE) and the evaluation of DQfD against other algorithms.This paper introduces Deep Q-learning from Demonstrations (DQfD), an algorithm that leverages small sets of demonstration data to accelerate the learning process in deep reinforcement learning (RL). DQfD combines temporal difference (TD) updates with supervised classification of the demonstrator's actions, allowing the agent to pre-train on demonstration data and then continue learning with self-generated data. The algorithm uses a prioritized replay mechanism to automatically control the ratio of demonstration to self-generated data, which is critical for improving performance. Experimental results show that DQfD outperforms Prioritized Dueling Double Deep Q-Networks (PDD DQN) in 41 out of 42 games on the first million steps and on average, it takes PDD DQN 83 million steps to catch up. DQfD also outperforms pure imitation learning in 39 out of 42 games and learns to outperform the best demonstration in 14 out of 42 games. Additionally, DQfD achieves state-of-the-art results in 11 games using human demonstrations. The paper discusses the background of RL, related work, and the experimental setup, including the Arcade Learning Environment (ALE) and the evaluation of DQfD against other algorithms.