2024 | Yang Jin¹, Zhicheng Sun¹, Kun Xu², Kun Xu², Liwei Chen², Hao Jiang¹, Quzhe Huang¹, Chengru Song², Yuliang Liu², Di Zhang², Yang Song², Kun Gai², Yadong Mu¹
Video-LaVIT is a unified video-language pre-training method that enables Large Language Models (LLMs) to understand and generate video, image, and text content. The method addresses the challenges of effective large-scale pre-training for videos by decomposing videos into keyframes and temporal motions, which are then adapted to LLMs using efficient tokenization. Keyframes capture the primary visual semantics, while motion vectors represent the dynamic evolution of keyframes over time. This approach reduces the number of tokens needed to represent video temporal dynamics, making it more efficient for pre-training. The video tokenizer uses a pre-trained image tokenizer and a spatiotemporal motion encoder to convert video data into discrete tokens, while the video detokenizer maps these tokens back to the continuous pixel space for video generation. Video-LaVIT achieves competitive performance across 13 multimodal benchmarks, demonstrating its effectiveness in both understanding and generating video content. The method is capable of handling long videos through autoregressive pre-training, which helps in learning sequential relationships between video clips. The model is trained on large-scale multimodal data and shows strong performance in zero-shot video question answering and text-to-video generation. Video-LaVIT also excels in long video generation, with explicit noise constraints improving temporal consistency. The method's ability to decompose video into keyframes and motion vectors allows for more efficient and accurate video understanding and generation. The model is evaluated on various benchmarks, including image and video understanding, and shows strong results in both tasks. Video-LaVIT is a promising approach for multimodal pre-training, offering a unified framework for video, image, and text understanding and generation.Video-LaVIT is a unified video-language pre-training method that enables Large Language Models (LLMs) to understand and generate video, image, and text content. The method addresses the challenges of effective large-scale pre-training for videos by decomposing videos into keyframes and temporal motions, which are then adapted to LLMs using efficient tokenization. Keyframes capture the primary visual semantics, while motion vectors represent the dynamic evolution of keyframes over time. This approach reduces the number of tokens needed to represent video temporal dynamics, making it more efficient for pre-training. The video tokenizer uses a pre-trained image tokenizer and a spatiotemporal motion encoder to convert video data into discrete tokens, while the video detokenizer maps these tokens back to the continuous pixel space for video generation. Video-LaVIT achieves competitive performance across 13 multimodal benchmarks, demonstrating its effectiveness in both understanding and generating video content. The method is capable of handling long videos through autoregressive pre-training, which helps in learning sequential relationships between video clips. The model is trained on large-scale multimodal data and shows strong performance in zero-shot video question answering and text-to-video generation. Video-LaVIT also excels in long video generation, with explicit noise constraints improving temporal consistency. The method's ability to decompose video into keyframes and motion vectors allows for more efficient and accurate video understanding and generation. The model is evaluated on various benchmarks, including image and video understanding, and shows strong results in both tasks. Video-LaVIT is a promising approach for multimodal pre-training, offering a unified framework for video, image, and text understanding and generation.