Deep Q-Learning from Demonstrations

Deep Q-Learning from Demonstrations

2018 | Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendoraris, Ian Osband, Gabriel Dulac-Arnold, John Agapiou, Joel Z. Leibo, Audrunas Gruslys
Deep Q-learning from Demonstrations (DQfD) is a reinforcement learning algorithm that leverages demonstration data to accelerate learning. The algorithm combines temporal difference (TD) updates with supervised classification of the demonstrator's actions. DQfD pre-trains on demonstration data using a combination of TD and supervised losses, allowing it to learn a value function that satisfies the Bellman equation. After pre-training, the agent interacts with the environment, updating its network with a mix of demonstration and self-generated data. A prioritized replay mechanism automatically controls the ratio of demonstration data to self-generated data during learning. DQfD outperforms Prioritized Dueling Double DQN (PDD DQN) in 41 of 42 games on the first million steps, and on average, it takes PDD DQN 83 million steps to catch up to DQfD's performance. DQfD also outperforms pure imitation learning in mean score on 39 of 42 games and outperforms the best demonstration given in 14 of 42 games. DQfD leverages human demonstrations to achieve state-of-the-art results on 11 games. It performs better than three related algorithms for incorporating demonstration data into DQN. The algorithm was evaluated on the Arcade Learning Environment (ALE), a set of Atari games that are a standard benchmark for DQN. DQfD was tested on 42 Atari games, with human demonstrations ranging from 5,574 to 75,472 transitions per game. DQfD learns from a much smaller dataset compared to other similar work, as AlphaGo learns from 30 million human transitions and DQN learns from over 200 million frames. Despite this, DQfD achieves higher scores than PDD DQN for the first 36 million steps and matches PDD DQN's performance after that. DQfD outperforms the worst demonstration episode it was given on in 29 of 42 games and learns to play better than the best demonstration episode in 14 games. It also outperforms pure imitation learning in every game. The results show that DQfD is effective in real-world applications where demonstration data is available. The algorithm's ability to leverage demonstration data and its prioritized replay mechanism make it a strong choice for real-world reinforcement learning tasks.Deep Q-learning from Demonstrations (DQfD) is a reinforcement learning algorithm that leverages demonstration data to accelerate learning. The algorithm combines temporal difference (TD) updates with supervised classification of the demonstrator's actions. DQfD pre-trains on demonstration data using a combination of TD and supervised losses, allowing it to learn a value function that satisfies the Bellman equation. After pre-training, the agent interacts with the environment, updating its network with a mix of demonstration and self-generated data. A prioritized replay mechanism automatically controls the ratio of demonstration data to self-generated data during learning. DQfD outperforms Prioritized Dueling Double DQN (PDD DQN) in 41 of 42 games on the first million steps, and on average, it takes PDD DQN 83 million steps to catch up to DQfD's performance. DQfD also outperforms pure imitation learning in mean score on 39 of 42 games and outperforms the best demonstration given in 14 of 42 games. DQfD leverages human demonstrations to achieve state-of-the-art results on 11 games. It performs better than three related algorithms for incorporating demonstration data into DQN. The algorithm was evaluated on the Arcade Learning Environment (ALE), a set of Atari games that are a standard benchmark for DQN. DQfD was tested on 42 Atari games, with human demonstrations ranging from 5,574 to 75,472 transitions per game. DQfD learns from a much smaller dataset compared to other similar work, as AlphaGo learns from 30 million human transitions and DQN learns from over 200 million frames. Despite this, DQfD achieves higher scores than PDD DQN for the first 36 million steps and matches PDD DQN's performance after that. DQfD outperforms the worst demonstration episode it was given on in 29 of 42 games and learns to play better than the best demonstration episode in 14 games. It also outperforms pure imitation learning in every game. The results show that DQfD is effective in real-world applications where demonstration data is available. The algorithm's ability to leverage demonstration data and its prioritized replay mechanism make it a strong choice for real-world reinforcement learning tasks.
Reach us at info@study.space
[slides] Deep Q-learning From Demonstrations | StudySpace