iVideoGPT: Interactive VideoGPTs are Scalable World Models

iVideoGPT: Interactive VideoGPTs are Scalable World Models

2024 | Jialong Wu; Shaofeng Yin; Ningya Feng; Xu He; Dong Li; Jianye Hao; Mingsheng Long
iVideoGPT is a scalable autoregressive transformer framework that integrates multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, enabling interactive agent experiences through next-token prediction. It features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations, allowing for a 16× reduction in token sequence length. iVideoGPT is pre-trained on millions of human and robotic manipulation trajectories, establishing a versatile foundation adaptable to various downstream tasks, including action-conditioned video prediction, visual planning, and model-based reinforcement learning. The model achieves competitive performance compared to state-of-the-art methods. iVideoGPT bridges the gap between generative video models and practical model-based reinforcement learning applications. It is designed to scale, allowing pre-training on millions of human and robotic manipulation trajectories, resulting in a single, versatile foundation of interactive world models adaptable to a wide range of downstream tasks. The model's architecture includes a compressive tokenizer that discretizes video frames and an autoregressive transformer predicting subsequent tokens. The tokenizer is trained with a combination of L1 reconstruction loss, commitment loss, perceptual loss, and optionally an adversarial loss. The model is pre-trained on a large-scale dataset comprising millions of robotic and human manipulation trajectories and adapted to domain-specific tasks. The pre-trained models are publicly available to encourage further research. Extensive experiments demonstrate that iVideoGPT can simulate accurate and realistic experiences and provide competitive performance compared to state-of-the-art methods. The model's architecture enables both scalability and interactivity, making it suitable for a wide range of tasks. The model's tokenization method allows for efficient training and generation, enhancing video quality by decoupling context from dynamics. The model's flexibility in sequence modeling allows for goal-conditioned video prediction and action conditioning. The model's performance is evaluated on various tasks, including video prediction, visual planning, and visual model-based reinforcement learning. The results show that iVideoGPT outperforms existing methods in terms of performance and efficiency. The model's ability to generalize to new domains and tasks is demonstrated through zero-shot prediction and few-shot adaptation. The model's scalability is validated through experiments with larger model sizes, showing improved performance with increased model size. The model's tokenization efficiency is demonstrated by comparing it to standard VQGAN tokenizers, showing that the proposed method provides more consistent contextual information and significantly enhances computational efficiency. The model's context-dynamics decoupling is validated through experiments showing that the decoder can still reproduce movement trajectories accurately with minimal contextual information. The model's goal-conditioned prediction is demonstrated through video prediction generated by goal-conditioned iVideoGPT, pre-trained on massive human and robotic videos. The model's ability to handle complex tasks and its flexibility in sequence modeling highlight its potential for real-world applications. The model's performance is compared to recurrent world models, showing that it lacks the capacity for large-scale pre-training on real-world data. The model'siVideoGPT is a scalable autoregressive transformer framework that integrates multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, enabling interactive agent experiences through next-token prediction. It features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations, allowing for a 16× reduction in token sequence length. iVideoGPT is pre-trained on millions of human and robotic manipulation trajectories, establishing a versatile foundation adaptable to various downstream tasks, including action-conditioned video prediction, visual planning, and model-based reinforcement learning. The model achieves competitive performance compared to state-of-the-art methods. iVideoGPT bridges the gap between generative video models and practical model-based reinforcement learning applications. It is designed to scale, allowing pre-training on millions of human and robotic manipulation trajectories, resulting in a single, versatile foundation of interactive world models adaptable to a wide range of downstream tasks. The model's architecture includes a compressive tokenizer that discretizes video frames and an autoregressive transformer predicting subsequent tokens. The tokenizer is trained with a combination of L1 reconstruction loss, commitment loss, perceptual loss, and optionally an adversarial loss. The model is pre-trained on a large-scale dataset comprising millions of robotic and human manipulation trajectories and adapted to domain-specific tasks. The pre-trained models are publicly available to encourage further research. Extensive experiments demonstrate that iVideoGPT can simulate accurate and realistic experiences and provide competitive performance compared to state-of-the-art methods. The model's architecture enables both scalability and interactivity, making it suitable for a wide range of tasks. The model's tokenization method allows for efficient training and generation, enhancing video quality by decoupling context from dynamics. The model's flexibility in sequence modeling allows for goal-conditioned video prediction and action conditioning. The model's performance is evaluated on various tasks, including video prediction, visual planning, and visual model-based reinforcement learning. The results show that iVideoGPT outperforms existing methods in terms of performance and efficiency. The model's ability to generalize to new domains and tasks is demonstrated through zero-shot prediction and few-shot adaptation. The model's scalability is validated through experiments with larger model sizes, showing improved performance with increased model size. The model's tokenization efficiency is demonstrated by comparing it to standard VQGAN tokenizers, showing that the proposed method provides more consistent contextual information and significantly enhances computational efficiency. The model's context-dynamics decoupling is validated through experiments showing that the decoder can still reproduce movement trajectories accurately with minimal contextual information. The model's goal-conditioned prediction is demonstrated through video prediction generated by goal-conditioned iVideoGPT, pre-trained on massive human and robotic videos. The model's ability to handle complex tasks and its flexibility in sequence modeling highlight its potential for real-world applications. The model's performance is compared to recurrent world models, showing that it lacks the capacity for large-scale pre-training on real-world data. The model's
Reach us at info@futurestudyspace.com