WorldGPT: Empowering LLM as Multimodal World Model

WorldGPT: Empowering LLM as Multimodal World Model

28 Sep 2024 | Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, Yueting Zhuang
WorldGPT is a novel generalist world model designed to understand and predict state transitions across various modalities. The model is built on a Multimodal Large Language Model (MLLM) and trained using millions of videos from diverse domains. To enhance its capabilities in specialized scenarios and long-term tasks, WorldGPT incorporates a cognitive architecture that includes memory offloading, knowledge retrieval, and context reflection. The cognitive architecture consists of a working memory mechanism, a knowledge retrieval system, and a ContextReflector for extracting relevant information from retrieved contexts. WorldGPT is evaluated using WorldNet, a comprehensive dataset for multimodal state transition predictions, which includes both raw internet videos (WorldNet-Wild) and high-quality, curated samples (WorldNet-Crafted). Experiments demonstrate WorldGPT's proficiency in modeling world dynamics and its effectiveness as a universal world simulator, capable of synthesizing dynamic scenes and transferring specialized knowledge to downstream agents through dream tuning. The project is available on GitHub.WorldGPT is a novel generalist world model designed to understand and predict state transitions across various modalities. The model is built on a Multimodal Large Language Model (MLLM) and trained using millions of videos from diverse domains. To enhance its capabilities in specialized scenarios and long-term tasks, WorldGPT incorporates a cognitive architecture that includes memory offloading, knowledge retrieval, and context reflection. The cognitive architecture consists of a working memory mechanism, a knowledge retrieval system, and a ContextReflector for extracting relevant information from retrieved contexts. WorldGPT is evaluated using WorldNet, a comprehensive dataset for multimodal state transition predictions, which includes both raw internet videos (WorldNet-Wild) and high-quality, curated samples (WorldNet-Crafted). Experiments demonstrate WorldGPT's proficiency in modeling world dynamics and its effectiveness as a universal world simulator, capable of synthesizing dynamic scenes and transferring specialized knowledge to downstream agents through dream tuning. The project is available on GitHub.
Reach us at info@study.space
[slides and audio] WorldGPT%3A Empowering LLM as Multimodal World Model