This paper presents a method for training agents in reinforcement learning (RL) environments using recurrent world models and evolution strategies. The approach involves training a generative recurrent neural network (RNN) to model the environment in an unsupervised manner, extracting compressed spatiotemporal features that are then used to train a simple policy via evolution. The agent is trained entirely within an environment generated by its own world model, and the policy is then transferred back to the actual environment. The world model's internal representations are used by a controller (C) to make decisions based on the environment's state.
The agent consists of three components: a visual sensory component (V) that compresses visual input into a latent representation, a memory component (M) that predicts future states based on past observations, and a controller (C) that makes decisions based on the compressed representations. The controller is trained using evolution strategies, such as Covariance-Matrix Adaptation Evolution Strategy (CMA-ES), to maximize the expected cumulative reward.
The method is tested on a car racing task (CarRacing-v0) and a video game environment (DoomTakeCover-v0). In the car racing task, the agent achieves a score of 906 ± 21, outperforming previous methods. In the DoomTakeCover-v0 task, the agent achieves a score of 1092 ± 556, demonstrating the effectiveness of the approach. The method also shows that training the agent in a generated environment can lead to better performance in the actual environment.
The paper also discusses the limitations of the approach, including the potential for the agent to exploit imperfections in the generated environment. To address this, the temperature parameter of the world model is adjusted to control the amount of uncertainty in the generated environment. The results show that increasing the temperature makes the environment more challenging, which helps prevent the agent from exploiting the model's imperfections.
The paper concludes that the proposed approach offers a promising method for training agents in RL environments, with the potential to improve performance and reduce the need for extensive training in the actual environment. The method is also applicable to a wide range of tasks, including those involving high-dimensional visual data and complex environments.This paper presents a method for training agents in reinforcement learning (RL) environments using recurrent world models and evolution strategies. The approach involves training a generative recurrent neural network (RNN) to model the environment in an unsupervised manner, extracting compressed spatiotemporal features that are then used to train a simple policy via evolution. The agent is trained entirely within an environment generated by its own world model, and the policy is then transferred back to the actual environment. The world model's internal representations are used by a controller (C) to make decisions based on the environment's state.
The agent consists of three components: a visual sensory component (V) that compresses visual input into a latent representation, a memory component (M) that predicts future states based on past observations, and a controller (C) that makes decisions based on the compressed representations. The controller is trained using evolution strategies, such as Covariance-Matrix Adaptation Evolution Strategy (CMA-ES), to maximize the expected cumulative reward.
The method is tested on a car racing task (CarRacing-v0) and a video game environment (DoomTakeCover-v0). In the car racing task, the agent achieves a score of 906 ± 21, outperforming previous methods. In the DoomTakeCover-v0 task, the agent achieves a score of 1092 ± 556, demonstrating the effectiveness of the approach. The method also shows that training the agent in a generated environment can lead to better performance in the actual environment.
The paper also discusses the limitations of the approach, including the potential for the agent to exploit imperfections in the generated environment. To address this, the temperature parameter of the world model is adjusted to control the amount of uncertainty in the generated environment. The results show that increasing the temperature makes the environment more challenging, which helps prevent the agent from exploiting the model's imperfections.
The paper concludes that the proposed approach offers a promising method for training agents in RL environments, with the potential to improve performance and reduce the need for extensive training in the actual environment. The method is also applicable to a wide range of tasks, including those involving high-dimensional visual data and complex environments.