World Models

World Models

9 May 2018 | David Ha, Jürgen Schmidhuber
This paper introduces a world model for reinforcement learning (RL) that enables an agent to learn a compressed spatial and temporal representation of its environment. The world model is trained in an unsupervised manner to predict future states based on current observations. By using features extracted from the world model as inputs to an agent, a compact and simple policy can be trained to solve tasks. The agent can even be trained entirely within its own hallucinated dream generated by the world model, and the policy can be transferred back into the actual environment. The agent model consists of three components: a visual sensory component that compresses observations into a latent code, a memory component that predicts future codes based on historical information, and a decision-making component that determines actions based on the compressed representations. The visual component (VAE) compresses input frames into a latent vector, while the memory component (MDN-RNN) predicts future latent vectors based on past observations and actions. The controller component (C) uses the latent vector and hidden state to determine actions. In the Car Racing experiment, the agent was trained to navigate a randomly generated track. The world model (V and M) was trained on 10,000 random rollouts, and the controller was trained to maximize cumulative reward. The agent achieved a score of 906 ± 21 over 100 trials, solving the task and outperforming previous methods. The agent was also able to transfer its policy to the actual environment, achieving a score of 1100 time steps. In the VizDoom experiment, the agent was trained to avoid fireballs in a virtual environment generated by the world model. The agent achieved a score of ~900 time steps in the virtual environment and transferred to the actual environment, achieving a score of ~1100 time steps. The agent was also able to discover an adversarial policy to avoid fireballs, but this policy failed in the actual environment. The paper discusses the limitations of the world model, including the potential for the model to generate trajectories that do not follow the actual environment's laws. It also discusses the use of evolution strategies for training the controller, which allows for efficient exploration of the search space. The paper concludes that the world model approach offers practical benefits for training agents in complex environments, and that further research is needed to improve the model's capacity and ability to handle more complex tasks.This paper introduces a world model for reinforcement learning (RL) that enables an agent to learn a compressed spatial and temporal representation of its environment. The world model is trained in an unsupervised manner to predict future states based on current observations. By using features extracted from the world model as inputs to an agent, a compact and simple policy can be trained to solve tasks. The agent can even be trained entirely within its own hallucinated dream generated by the world model, and the policy can be transferred back into the actual environment. The agent model consists of three components: a visual sensory component that compresses observations into a latent code, a memory component that predicts future codes based on historical information, and a decision-making component that determines actions based on the compressed representations. The visual component (VAE) compresses input frames into a latent vector, while the memory component (MDN-RNN) predicts future latent vectors based on past observations and actions. The controller component (C) uses the latent vector and hidden state to determine actions. In the Car Racing experiment, the agent was trained to navigate a randomly generated track. The world model (V and M) was trained on 10,000 random rollouts, and the controller was trained to maximize cumulative reward. The agent achieved a score of 906 ± 21 over 100 trials, solving the task and outperforming previous methods. The agent was also able to transfer its policy to the actual environment, achieving a score of 1100 time steps. In the VizDoom experiment, the agent was trained to avoid fireballs in a virtual environment generated by the world model. The agent achieved a score of ~900 time steps in the virtual environment and transferred to the actual environment, achieving a score of ~1100 time steps. The agent was also able to discover an adversarial policy to avoid fireballs, but this policy failed in the actual environment. The paper discusses the limitations of the world model, including the potential for the model to generate trajectories that do not follow the actual environment's laws. It also discusses the use of evolution strategies for training the controller, which allows for efficient exploration of the search space. The paper concludes that the world model approach offers practical benefits for training agents in complex environments, and that further research is needed to improve the model's capacity and ability to handle more complex tasks.
Reach us at info@study.space