March 10, 2021 | Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław "Psyho" Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafał Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, Susan Zhang
OpenAI Five, a reinforcement learning agent, defeated the world champions in Dota 2, demonstrating superhuman performance in a complex, real-time strategy game. The game presents challenges such as long time horizons, imperfect information, and high-dimensional state-action spaces. OpenAI Five leveraged existing reinforcement learning techniques and scaled them to learn from batches of approximately 2 million frames every 2 seconds. The team developed a distributed training system and tools for continual training, allowing OpenAI Five to train for 10 months. The key ingredients for achieving superhuman performance include scaling up the compute resources, using Proximal Policy Optimization (PPO), and implementing a distributed training system. The paper also discusses the challenges of training in a continuously changing environment and the use of "surgery" techniques to maintain performance across model and environment changes. The results highlight the potential of reinforcement learning in solving complex, real-world problems.OpenAI Five, a reinforcement learning agent, defeated the world champions in Dota 2, demonstrating superhuman performance in a complex, real-time strategy game. The game presents challenges such as long time horizons, imperfect information, and high-dimensional state-action spaces. OpenAI Five leveraged existing reinforcement learning techniques and scaled them to learn from batches of approximately 2 million frames every 2 seconds. The team developed a distributed training system and tools for continual training, allowing OpenAI Five to train for 10 months. The key ingredients for achieving superhuman performance include scaling up the compute resources, using Proximal Policy Optimization (PPO), and implementing a distributed training system. The paper also discusses the challenges of training in a continuously changing environment and the use of "surgery" techniques to maintain performance across model and environment changes. The results highlight the potential of reinforcement learning in solving complex, real-world problems.