Dota 2 with Large Scale Deep Reinforcement Learning

Dota 2 with Large Scale Deep Reinforcement Learning

March 10, 2021 | Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław "Psyho" Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafał Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, Susan Zhang
OpenAI Five, an AI system developed by OpenAI, achieved superhuman performance in the complex real-time strategy game Dota 2 by using large-scale deep reinforcement learning. The system was trained for 10 months using a distributed training system and tools for continual training, allowing it to defeat the Dota 2 world champion, Team OG, in a best-of-three match. OpenAI Five demonstrated that self-play reinforcement learning can achieve superhuman performance on difficult tasks. Dota 2 presents significant challenges for AI systems, including long time horizons, imperfect information, and complex, continuous state-action spaces. The game involves two teams of five players, each controlling a hero with unique abilities. The game is played on a large map with numerous units, buildings, and game features, requiring the AI to make decisions based on incomplete information and model the opponent's behavior. To address these challenges, OpenAI Five used a policy function parameterized as a recurrent neural network with approximately 159 million parameters. The system used Proximal Policy Optimization (PPO) and Generalized Advantage Estimation (GAE) to train the policy, which was optimized using a large batch size and distributed computing resources. The system also used a technique called "surgery" to adapt the model to changes in the environment and game mechanics, allowing continuous training without loss in performance. The training process involved self-play, where the AI system played against itself, and used a reward function that included signals such as characters dying, collecting resources, and other game-related events. The system also used a TrueSkill rating system to evaluate the performance of the AI against fixed reference agents. OpenAI Five's success demonstrated the potential of deep reinforcement learning in complex, real-world tasks. The system's ability to adapt to changes in the environment and maintain performance over time highlights the importance of scalable and flexible training methods. The research also emphasized the need for further investigation into settings with ever-changing environments and iterative development.OpenAI Five, an AI system developed by OpenAI, achieved superhuman performance in the complex real-time strategy game Dota 2 by using large-scale deep reinforcement learning. The system was trained for 10 months using a distributed training system and tools for continual training, allowing it to defeat the Dota 2 world champion, Team OG, in a best-of-three match. OpenAI Five demonstrated that self-play reinforcement learning can achieve superhuman performance on difficult tasks. Dota 2 presents significant challenges for AI systems, including long time horizons, imperfect information, and complex, continuous state-action spaces. The game involves two teams of five players, each controlling a hero with unique abilities. The game is played on a large map with numerous units, buildings, and game features, requiring the AI to make decisions based on incomplete information and model the opponent's behavior. To address these challenges, OpenAI Five used a policy function parameterized as a recurrent neural network with approximately 159 million parameters. The system used Proximal Policy Optimization (PPO) and Generalized Advantage Estimation (GAE) to train the policy, which was optimized using a large batch size and distributed computing resources. The system also used a technique called "surgery" to adapt the model to changes in the environment and game mechanics, allowing continuous training without loss in performance. The training process involved self-play, where the AI system played against itself, and used a reward function that included signals such as characters dying, collecting resources, and other game-related events. The system also used a TrueSkill rating system to evaluate the performance of the AI against fixed reference agents. OpenAI Five's success demonstrated the potential of deep reinforcement learning in complex, real-world tasks. The system's ability to adapt to changes in the environment and maintain performance over time highlights the importance of scalable and flexible training methods. The research also emphasized the need for further investigation into settings with ever-changing environments and iterative development.
Reach us at info@study.space
[slides] Dota 2 with Large Scale Deep Reinforcement Learning | StudySpace