Understanding Evolution Strategies as a Scalable Alternative to Reinforcement Learning

The paper explores the use of Evolution Strategies (ES), a class of black-box optimization algorithms, as an alternative to popular Markov Decision Process (MDP)-based Reinforcement Learning (RL) techniques such as Q-learning and Policy Gradients. The authors demonstrate that ES can be highly scalable and efficient, particularly in a distributed computing environment. Key findings include: 1. **Parallelizability**: ES is highly parallelizable, achieving linear speedups with over 1,440 workers, allowing for rapid training on complex tasks like 3D humanoid walking in under 10 minutes. 2. **Data Efficiency**: ES matches the performance of A3C on most Atari environments using significantly less data, with a slight decrease in data efficiency offset by reduced computation due to not performing backpropagation or value function approximation. 3. **Exploration Behavior**: ES exhibits better exploration behavior compared to policy gradient methods, learning a wide variety of gaits in MuJoCo tasks. 4. **Robustness**: ES is robust to hyperparameter settings, achieving competitive results with fixed hyperparameters across different environments. 5. **Advantages of Black-Box Optimization**: ES is indifferent to reward distribution, tolerant of long horizons, and does not require temporal discounting or value function approximation. The paper also discusses the implementation details of ES, including the use of virtual batch normalization and reparameterizations to improve reliability, and the benefits of smoothing in parameter space versus action space. Experimental results on MuJoCo and Atari games show that ES is a viable and competitive alternative to traditional RL techniques, particularly in terms of scalability and efficiency.The paper explores the use of Evolution Strategies (ES), a class of black-box optimization algorithms, as an alternative to popular Markov Decision Process (MDP)-based Reinforcement Learning (RL) techniques such as Q-learning and Policy Gradients. The authors demonstrate that ES can be highly scalable and efficient, particularly in a distributed computing environment. Key findings include: 1. **Parallelizability**: ES is highly parallelizable, achieving linear speedups with over 1,440 workers, allowing for rapid training on complex tasks like 3D humanoid walking in under 10 minutes. 2. **Data Efficiency**: ES matches the performance of A3C on most Atari environments using significantly less data, with a slight decrease in data efficiency offset by reduced computation due to not performing backpropagation or value function approximation. 3. **Exploration Behavior**: ES exhibits better exploration behavior compared to policy gradient methods, learning a wide variety of gaits in MuJoCo tasks. 4. **Robustness**: ES is robust to hyperparameter settings, achieving competitive results with fixed hyperparameters across different environments. 5. **Advantages of Black-Box Optimization**: ES is indifferent to reward distribution, tolerant of long horizons, and does not require temporal discounting or value function approximation. The paper also discusses the implementation details of ES, including the use of virtual batch normalization and reparameterizations to improve reliability, and the benefits of smoothing in parameter space versus action space. Experimental results on MuJoCo and Atari games show that ES is a viable and competitive alternative to traditional RL techniques, particularly in terms of scalability and efficiency.

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

7 Sep 2017 | Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, Ilya Sutskever