Understanding RL%24%5E2%24%3A Fast Reinforcement Learning via Slow Reinforcement Learning

The paper "RL²: Fast Reinforcement Learning via Slow Reinforcement Learning" by Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel proposes a novel approach to reinforcement learning (RL) that aims to bridge the gap between the slow learning process of humans and the high sample complexity required by current deep RL algorithms. The authors introduce RL², a method where the RL algorithm is encoded as the weights of a recurrent neural network (RNN) and learned slowly through a general-purpose RL algorithm. The RNN receives all information typically used by an RL algorithm, including observations, actions, rewards, and termination flags, and retains its state across episodes in a Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current MDP. The paper evaluates RL² on both small-scale and large-scale problems. On small-scale tasks, such as multi-armed bandit problems and finite MDPs, RL² achieves performance comparable to human-designed algorithms with optimality guarantees. On large-scale tasks, such as a vision-based navigation task, RL² demonstrates scalability to high-dimensional problems. The method is structured as an end-to-end optimization process where the agent learns to act as both the learning algorithm and the policy. The authors use a first-order implementation of Trust Region Policy Optimization (TRPO) for policy optimization and a baseline represented as an RNN to reduce variance in stochastic gradient estimation. They also apply Generalized Advantage Estimation (GAE) to further reduce variance. Experiments show that RL² can achieve performance comparable to theoretically optimal algorithms in small-scale settings and scale to high-dimensional tasks. The authors identify opportunities for improvement, such as better outer-loop reinforcement learning algorithms and policy architectures, and suggest that exploiting problem structure could significantly boost performance.The paper "RL²: Fast Reinforcement Learning via Slow Reinforcement Learning" by Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel proposes a novel approach to reinforcement learning (RL) that aims to bridge the gap between the slow learning process of humans and the high sample complexity required by current deep RL algorithms. The authors introduce RL², a method where the RL algorithm is encoded as the weights of a recurrent neural network (RNN) and learned slowly through a general-purpose RL algorithm. The RNN receives all information typically used by an RL algorithm, including observations, actions, rewards, and termination flags, and retains its state across episodes in a Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current MDP. The paper evaluates RL² on both small-scale and large-scale problems. On small-scale tasks, such as multi-armed bandit problems and finite MDPs, RL² achieves performance comparable to human-designed algorithms with optimality guarantees. On large-scale tasks, such as a vision-based navigation task, RL² demonstrates scalability to high-dimensional problems. The method is structured as an end-to-end optimization process where the agent learns to act as both the learning algorithm and the policy. The authors use a first-order implementation of Trust Region Policy Optimization (TRPO) for policy optimization and a baseline represented as an RNN to reduce variance in stochastic gradient estimation. They also apply Generalized Advantage Estimation (GAE) to further reduce variance. Experiments show that RL² can achieve performance comparable to theoretically optimal algorithms in small-scale settings and scale to high-dimensional tasks. The authors identify opportunities for improvement, such as better outer-loop reinforcement learning algorithms and policy architectures, and suggest that exploiting problem structure could significantly boost performance.

RL^2: FAST REINFORCEMENT LEARNING VIA SLOW REINFORCEMENT LEARNING

| Yan Duan†‡, John Schulman†‡, Xi Chen†‡, Peter L. Bartlett†, Ilya Sutskever†, Pieter Abbeel†‡