31 Oct 2024 | Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, Mingsheng Long
The paper introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework designed to integrate multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, enabling interactive agent experiences through next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations, reducing the token sequence length by up to 16 times. This approach not only facilitates more efficient training and generation but also enhances video quality by decoupling context from dynamics.
iVideoGPT is pre-trained on millions of human and robotic manipulation trajectories, establishing a versatile foundation adaptable to various downstream tasks such as action-conditioned video prediction, visual planning, and model-based reinforcement learning. The model demonstrates competitive performance compared to state-of-the-art methods in these tasks. The paper also explores the flexibility of sequence modeling, showcasing goal-conditioned video prediction and the model's ability to adapt to unseen domains with minimal fine-tuning.
The main contributions of the work include the introduction of iVideoGPT, its pre-training on large-scale datasets, and its successful application in a range of practical tasks. The paper discusses the limitations and future directions, highlighting the need for more extensive data and the incorporation of additional modalities. The code and pre-trained models are available for further research.The paper introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework designed to integrate multimodal signals—visual observations, actions, and rewards—into a sequence of tokens, enabling interactive agent experiences through next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations, reducing the token sequence length by up to 16 times. This approach not only facilitates more efficient training and generation but also enhances video quality by decoupling context from dynamics.
iVideoGPT is pre-trained on millions of human and robotic manipulation trajectories, establishing a versatile foundation adaptable to various downstream tasks such as action-conditioned video prediction, visual planning, and model-based reinforcement learning. The model demonstrates competitive performance compared to state-of-the-art methods in these tasks. The paper also explores the flexibility of sequence modeling, showcasing goal-conditioned video prediction and the model's ability to adapt to unseen domains with minimal fine-tuning.
The main contributions of the work include the introduction of iVideoGPT, its pre-training on large-scale datasets, and its successful application in a range of practical tasks. The paper discusses the limitations and future directions, highlighting the need for more extensive data and the incorporation of additional modalities. The code and pre-trained models are available for further research.