WorldGPT: Empowering LLM as Multimodal World Model

WorldGPT: Empowering LLM as Multimodal World Model

28 Sep 2024 | Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, Yuetong Zhuang
WorldGPT is a generalist world model built on a Multimodal Large Language Model (MLLM), designed to predict state transitions across various modalities. It is trained on millions of videos from diverse domains, enabling it to understand and predict complex scenarios. To enhance its capabilities, WorldGPT integrates a novel cognitive architecture that includes memory offloading, knowledge retrieval, and context reflection. This architecture allows WorldGPT to handle specialized scenarios and long-term tasks more effectively. WorldNet, a multimodal state transition prediction benchmark, was developed to evaluate WorldGPT. It consists of two subsets: WorldNet-Wild, derived from raw internet videos, and WorldNet-Crafted, transformed from high-quality datasets. WorldNet provides a comprehensive dataset for training and evaluating world models, covering a wide range of real-world scenarios and tasks. WorldGPT is capable of synthesizing multimodal instruction instances, which are as reliable as authentic data for fine-tuning purposes. This enables multimodal agents to generalize to unfamiliar domains. The project is available on GitHub at https://github.com/DCDmllm/WorldGPT. WorldGPT is composed of three modules: multimodal encoders, a Large Language Model (LLM) integrated with the cognitive architecture, and multimodal decoders. The LLM is trained using a progressive state transition training methodology, which ensures the model's effectiveness in complex situations. The cognitive architecture enhances WorldGPT's ability to retrieve external knowledge and maintain temporal consistency in predictions. WorldGPT has been evaluated on various tasks, including visual understanding, embodied planning, and audio-video question answering. The results demonstrate its proficiency in modeling world dynamics. Additionally, WorldGPT can serve as a universal world simulator, generating dynamic scenes that change according to complex interactions, offering more practical significance than previous generation models. The paper also explores a novel learning paradigm for multimodal agents, called dream tuning, where agents acquire specialized knowledge from WorldGPT to enhance their performance on specific tasks by fine-tuning on synthetic multimodal instruction data. This approach has shown competitive performance compared to agents trained on authentic data, supporting the reliability of WorldGPT as a world simulator.WorldGPT is a generalist world model built on a Multimodal Large Language Model (MLLM), designed to predict state transitions across various modalities. It is trained on millions of videos from diverse domains, enabling it to understand and predict complex scenarios. To enhance its capabilities, WorldGPT integrates a novel cognitive architecture that includes memory offloading, knowledge retrieval, and context reflection. This architecture allows WorldGPT to handle specialized scenarios and long-term tasks more effectively. WorldNet, a multimodal state transition prediction benchmark, was developed to evaluate WorldGPT. It consists of two subsets: WorldNet-Wild, derived from raw internet videos, and WorldNet-Crafted, transformed from high-quality datasets. WorldNet provides a comprehensive dataset for training and evaluating world models, covering a wide range of real-world scenarios and tasks. WorldGPT is capable of synthesizing multimodal instruction instances, which are as reliable as authentic data for fine-tuning purposes. This enables multimodal agents to generalize to unfamiliar domains. The project is available on GitHub at https://github.com/DCDmllm/WorldGPT. WorldGPT is composed of three modules: multimodal encoders, a Large Language Model (LLM) integrated with the cognitive architecture, and multimodal decoders. The LLM is trained using a progressive state transition training methodology, which ensures the model's effectiveness in complex situations. The cognitive architecture enhances WorldGPT's ability to retrieve external knowledge and maintain temporal consistency in predictions. WorldGPT has been evaluated on various tasks, including visual understanding, embodied planning, and audio-video question answering. The results demonstrate its proficiency in modeling world dynamics. Additionally, WorldGPT can serve as a universal world simulator, generating dynamic scenes that change according to complex interactions, offering more practical significance than previous generation models. The paper also explores a novel learning paradigm for multimodal agents, called dream tuning, where agents acquire specialized knowledge from WorldGPT to enhance their performance on specific tasks by fine-tuning on synthetic multimodal instruction data. This approach has shown competitive performance compared to agents trained on authentic data, supporting the reliability of WorldGPT as a world simulator.
Reach us at info@study.space
[slides and audio] WorldGPT%3A Empowering LLM as Multimodal World Model