Genie: Generative Interactive Environments

Genie: Generative Interactive Environments

2024-2-26 | Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge (Jimmy) Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder Singh and Tim Rocktäschel
Genie is a new generative interactive environment trained from unlabelled Internet videos, capable of generating interactive, playable environments based on text, images, or sketches. It uses a latent action interface, learned fully unsupervised, to enable frame-by-frame control without requiring ground-truth action labels. Genie is a foundation world model with 11B parameters, consisting of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a latent action model. It can be used to generate diverse, controllable environments and train agents to imitate behaviors from unseen videos. Genie is trained on a large dataset of 200,000 hours of Internet gaming videos and can be used to generate virtual worlds from images or sketches. It also demonstrates the ability to learn latent actions from Internet videos to infer policies from unseen action-free videos of simulated environments. Genie's architecture includes a spatiotemporal transformer, a video tokenizer, and a dynamics model, and it scales well with increasing computational resources. The model is trained on a large-scale dataset of 2D platformer games and can generate high-quality, controllable videos across diverse domains. Genie can be used to train agents in unseen environments and has shown promising results in imitation learning. The model is also capable of generating realistic, controllable environments and has potential applications in robotics and other domains. The paper discusses the methodology, experimental results, and broader impact of Genie, highlighting its potential for future research and applications.Genie is a new generative interactive environment trained from unlabelled Internet videos, capable of generating interactive, playable environments based on text, images, or sketches. It uses a latent action interface, learned fully unsupervised, to enable frame-by-frame control without requiring ground-truth action labels. Genie is a foundation world model with 11B parameters, consisting of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a latent action model. It can be used to generate diverse, controllable environments and train agents to imitate behaviors from unseen videos. Genie is trained on a large dataset of 200,000 hours of Internet gaming videos and can be used to generate virtual worlds from images or sketches. It also demonstrates the ability to learn latent actions from Internet videos to infer policies from unseen action-free videos of simulated environments. Genie's architecture includes a spatiotemporal transformer, a video tokenizer, and a dynamics model, and it scales well with increasing computational resources. The model is trained on a large-scale dataset of 2D platformer games and can generate high-quality, controllable videos across diverse domains. Genie can be used to train agents in unseen environments and has shown promising results in imitation learning. The model is also capable of generating realistic, controllable environments and has potential applications in robotics and other domains. The paper discusses the methodology, experimental results, and broader impact of Genie, highlighting its potential for future research and applications.
Reach us at info@study.space
Understanding Genie%3A Generative Interactive Environments