OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

13 Jun 2024 | Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang Jiang
OmniTokenizer is a transformer-based tokenizer designed for joint image and video tokenization, addressing the limitations of existing tokenizers tailored for either images or videos. It employs a spatial-temporal decoupled architecture, integrating window attention for spatial modeling and causal attention for temporal modeling. A progressive training strategy is proposed, starting with image pre-training on a fixed resolution to develop spatial encoding capabilities, followed by joint training on image and video data at multiple resolutions to learn temporal dynamics. Extensive experiments on various datasets, including ImageNet and UCF-101, demonstrate that OmniTokenizer achieves state-of-the-art reconstruction performance, outperforming previous methods by significant margins. Additionally, when integrated with language model-based and diffusion models, OmniTokenizer enhances their visual synthesis performance, showcasing its versatility and superiority. The paper also includes an ablation study to validate the effectiveness of the proposed methods and discusses future directions for scaling model capacity.OmniTokenizer is a transformer-based tokenizer designed for joint image and video tokenization, addressing the limitations of existing tokenizers tailored for either images or videos. It employs a spatial-temporal decoupled architecture, integrating window attention for spatial modeling and causal attention for temporal modeling. A progressive training strategy is proposed, starting with image pre-training on a fixed resolution to develop spatial encoding capabilities, followed by joint training on image and video data at multiple resolutions to learn temporal dynamics. Extensive experiments on various datasets, including ImageNet and UCF-101, demonstrate that OmniTokenizer achieves state-of-the-art reconstruction performance, outperforming previous methods by significant margins. Additionally, when integrated with language model-based and diffusion models, OmniTokenizer enhances their visual synthesis performance, showcasing its versatility and superiority. The paper also includes an ablation study to validate the effectiveness of the proposed methods and discusses future directions for scaling model capacity.
Reach us at info@study.space