3 Jun 2024 | Yang Jin, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu
**Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization**
This paper addresses the challenge of effective large-scale pre-training of multimodal Large Language Models (LLMs) for video content, which is more complex than image data due to its spatiotemporal dynamics. The proposed Video-LaVIT framework decomposes videos into keyframes and temporal motions, enabling efficient tokenization of visual and temporal information. This decomposition reduces the number of tokens required for pre-training, making it more computationally efficient. The key contributions include:
1. **Efficient Video Representation**: Videos are decomposed into keyframes and motion vectors, which are then tokenized using a novel video tokenizer. This approach significantly reduces the number of tokens needed to represent video temporal dynamics.
2. **Unified Generative Pre-training**: The decomposed video representation is used for unified generative pre-training of LLMs, allowing the model to understand and generate both images and videos.
3. **Competitive Performance**: Video-LaVIT achieves state-of-the-art performance on 13 multimodal benchmarks, demonstrating its effectiveness in both understanding and generating video content.
The paper also includes a detailed methodological section, experimental results, and an ablation study to validate the effectiveness of the proposed approach. The code and models are available at <https://video-lavit.github.io>.**Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization**
This paper addresses the challenge of effective large-scale pre-training of multimodal Large Language Models (LLMs) for video content, which is more complex than image data due to its spatiotemporal dynamics. The proposed Video-LaVIT framework decomposes videos into keyframes and temporal motions, enabling efficient tokenization of visual and temporal information. This decomposition reduces the number of tokens required for pre-training, making it more computationally efficient. The key contributions include:
1. **Efficient Video Representation**: Videos are decomposed into keyframes and motion vectors, which are then tokenized using a novel video tokenizer. This approach significantly reduces the number of tokens needed to represent video temporal dynamics.
2. **Unified Generative Pre-training**: The decomposed video representation is used for unified generative pre-training of LLMs, allowing the model to understand and generate both images and videos.
3. **Competitive Performance**: Video-LaVIT achieves state-of-the-art performance on 13 multimodal benchmarks, demonstrating its effectiveness in both understanding and generating video content.
The paper also includes a detailed methodological section, experimental results, and an ablation study to validate the effectiveness of the proposed approach. The code and models are available at <https://video-lavit.github.io>.