Diffusion for World Modeling: Visual Details Matter in Atari

Diffusion for World Modeling: Visual Details Matter in Atari

2024 | Eloi Alonso*, Adam Jelley*, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce†, François Fleuret†
This paper introduces DIAMOND, a reinforcement learning agent trained in a diffusion world model. The authors argue that traditional world models, which use discrete latent variables, may lose important visual details necessary for effective reinforcement learning. In contrast, diffusion models, which are widely used for image generation, can capture fine-grained visual details, making them more suitable for world modeling. DIAMOND is a diffusion-based world model that enables the agent to generate realistic visual observations, which are then used to train the agent in imagination. The model is trained on the Atari 100k benchmark, achieving a mean human normalized score of 1.46, a new state-of-the-art for agents trained entirely within a world model. The diffusion world model is also shown to be capable of generating interactive neural game engines by training on static Counter-Strike: Global Offensive gameplay. The authors also analyze the key design choices that make diffusion suitable for world modeling, and demonstrate how improved visual details can lead to improved agent performance. They find that DIAMOND performs particularly well on environments where capturing small details is important, such as Asterix, Breakout, and Road Runner. The paper also discusses the limitations of the approach, including the need for more research on integrating reward/termination prediction into the diffusion model and the potential for further improvements in scalability and memory efficiency. Overall, the study highlights the potential of diffusion models for world modeling and demonstrates their effectiveness in improving agent performance through better visual details.This paper introduces DIAMOND, a reinforcement learning agent trained in a diffusion world model. The authors argue that traditional world models, which use discrete latent variables, may lose important visual details necessary for effective reinforcement learning. In contrast, diffusion models, which are widely used for image generation, can capture fine-grained visual details, making them more suitable for world modeling. DIAMOND is a diffusion-based world model that enables the agent to generate realistic visual observations, which are then used to train the agent in imagination. The model is trained on the Atari 100k benchmark, achieving a mean human normalized score of 1.46, a new state-of-the-art for agents trained entirely within a world model. The diffusion world model is also shown to be capable of generating interactive neural game engines by training on static Counter-Strike: Global Offensive gameplay. The authors also analyze the key design choices that make diffusion suitable for world modeling, and demonstrate how improved visual details can lead to improved agent performance. They find that DIAMOND performs particularly well on environments where capturing small details is important, such as Asterix, Breakout, and Road Runner. The paper also discusses the limitations of the approach, including the need for more research on integrating reward/termination prediction into the diffusion model and the potential for further improvements in scalability and memory efficiency. Overall, the study highlights the potential of diffusion models for world modeling and demonstrates their effectiveness in improving agent performance through better visual details.
Reach us at info@study.space
[slides and audio] Diffusion for World Modeling%3A Visual Details Matter in Atari