Genie: Generative Interactive Environments

Genie: Generative Interactive Environments

2024-2-26 | Jake Bruce*, 1, Michael Dennis*, 1, Ashley Edwards*, 1, Jack Parker-Holder*, 1, Yuge (Jimmy) Shi*, 1, Edward Hughes1, Matthew Lai1, Aditi Mavalankar1, Richie Steigerwald1, Chris Apps1, Yusuf Aytar1, Sarah Bechtle1, Feryal Bebbahani1, Stephanie Chan1, Nicolas Heess1, Lucy Gonzalez1, Simon Osindero1, Sherjil Ozair1, Scott Reed1, Jingwei Zhang1, Konrad Zolna1, Jeff Clune1,2, Nando de Freitas1, Satinder Singh1 and Tim Rocktäschel*, 1
Genie is a novel generative interactive environment trained unsupervised from unlabelled Internet videos. It can generate a wide variety of interactive, playable environments based on text, synthetic images, photographs, and sketches. At 11 billion parameters, Genie serves as a foundation world model, capable of generating and controlling virtual worlds through latent actions. The model consists of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a latent action model. Despite training without ground-truth action labels, Genie enables users to act in generated environments on a frame-by-frame basis. The learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening new avenues for training generalist agents. Genie demonstrates high video fidelity and controllability, and can generate diverse trajectories in unseen reinforcement learning environments. The model's generality and controllability make it a promising tool for future research in interactive environments and agent training.Genie is a novel generative interactive environment trained unsupervised from unlabelled Internet videos. It can generate a wide variety of interactive, playable environments based on text, synthetic images, photographs, and sketches. At 11 billion parameters, Genie serves as a foundation world model, capable of generating and controlling virtual worlds through latent actions. The model consists of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a latent action model. Despite training without ground-truth action labels, Genie enables users to act in generated environments on a frame-by-frame basis. The learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening new avenues for training generalist agents. Genie demonstrates high video fidelity and controllability, and can generate diverse trajectories in unseen reinforcement learning environments. The model's generality and controllability make it a promising tool for future research in interactive environments and agent training.
Reach us at info@study.space