7 Sep 2017 | Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, Ilya Sutskever
Evolution Strategies (ES) are presented as a scalable alternative to reinforcement learning (RL) methods like Q-learning and policy gradients. ES, a black-box optimization algorithm, excels in environments with sparse rewards, long horizons, and no need for temporal discounting or value function approximation. Experiments on MuJoCo and Atari show that ES can scale efficiently with the number of parallel workers, achieving strong results in 10 minutes for 3D humanoid walking and competitive performance on most Atari games after one hour of training. ES is highly parallelizable, using common random numbers to communicate scalars between workers, enabling linear speedups even with over 1,000 workers. It also demonstrates good data efficiency, often outperforming methods like A3C in terms of performance per data point. ES exhibits better exploration behavior than policy gradient methods, learning diverse gaits in MuJoCo tasks. It is robust, using fixed hyperparameters across environments. ES is invariant to action frequency and delayed rewards, and can handle long time horizons without requiring value function approximations. The algorithm is particularly effective in high-dimensional spaces and can be applied to complex control problems. ES is also well-suited for low-precision hardware and can incorporate non-differentiable elements. Experiments on MuJoCo and Atari show that ES is competitive with other RL methods, with significant improvements in training speed and data efficiency. The results demonstrate that ES is a viable and scalable approach for complex RL tasks.Evolution Strategies (ES) are presented as a scalable alternative to reinforcement learning (RL) methods like Q-learning and policy gradients. ES, a black-box optimization algorithm, excels in environments with sparse rewards, long horizons, and no need for temporal discounting or value function approximation. Experiments on MuJoCo and Atari show that ES can scale efficiently with the number of parallel workers, achieving strong results in 10 minutes for 3D humanoid walking and competitive performance on most Atari games after one hour of training. ES is highly parallelizable, using common random numbers to communicate scalars between workers, enabling linear speedups even with over 1,000 workers. It also demonstrates good data efficiency, often outperforming methods like A3C in terms of performance per data point. ES exhibits better exploration behavior than policy gradient methods, learning diverse gaits in MuJoCo tasks. It is robust, using fixed hyperparameters across environments. ES is invariant to action frequency and delayed rewards, and can handle long time horizons without requiring value function approximations. The algorithm is particularly effective in high-dimensional spaces and can be applied to complex control problems. ES is also well-suited for low-precision hardware and can incorporate non-differentiable elements. Experiments on MuJoCo and Atari show that ES is competitive with other RL methods, with significant improvements in training speed and data efficiency. The results demonstrate that ES is a viable and scalable approach for complex RL tasks.