IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

28 Jun 2018 | Lasse Espeholt *† Hubert Soyer *† Remi Munos *† Karen Simonyan † Volodymyr Mnih † Tom Ward † Yotam Doron † Vlad Firoiu † Tim Harley † Iain Dunning † Shane Legg † Koray Kavukcuoglu †
IMPALA is a scalable distributed deep reinforcement learning (DRL) framework that efficiently handles large-scale multi-task learning. It introduces a novel actor-learner architecture with a V-trace off-policy correction method, enabling high throughput and data efficiency. IMPALA decouples acting and learning, allowing actors to generate trajectories and send them to a central learner. The learner then updates the policy using these trajectories, while V-trace corrects for policy lag between actors and the learner. This architecture scales to thousands of machines without sacrificing data efficiency or stability. IMPALA achieves 250,000 frames per second, over 30 times faster than single-machine A3C. It also demonstrates positive transfer between tasks, achieving better performance than previous agents with less data. IMPALA was tested on DMLab-30 (30 tasks) and Atari-57 (57 games), showing superior performance compared to A3C-based agents. The source code is publicly available. V-trace is a general off-policy learning algorithm that improves stability and robustness in actor-critic methods. IMPALA's architecture and V-trace method enable efficient and scalable multi-task learning, making it a significant advancement in DRL.IMPALA is a scalable distributed deep reinforcement learning (DRL) framework that efficiently handles large-scale multi-task learning. It introduces a novel actor-learner architecture with a V-trace off-policy correction method, enabling high throughput and data efficiency. IMPALA decouples acting and learning, allowing actors to generate trajectories and send them to a central learner. The learner then updates the policy using these trajectories, while V-trace corrects for policy lag between actors and the learner. This architecture scales to thousands of machines without sacrificing data efficiency or stability. IMPALA achieves 250,000 frames per second, over 30 times faster than single-machine A3C. It also demonstrates positive transfer between tasks, achieving better performance than previous agents with less data. IMPALA was tested on DMLab-30 (30 tasks) and Atari-57 (57 games), showing superior performance compared to A3C-based agents. The source code is publicly available. V-trace is a general off-policy learning algorithm that improves stability and robustness in actor-critic methods. IMPALA's architecture and V-trace method enable efficient and scalable multi-task learning, making it a significant advancement in DRL.
Reach us at info@study.space
[slides and audio] IMPALA%3A Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures