OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

13 Jun 2024 | Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang Jiang
OmniTokenizer is a transformer-based tokenizer designed for joint image and video tokenization, addressing the limitations of existing tokenizers that are specialized for either image or video inputs. It employs a spatial-temporal decoupled architecture, integrating window and causal attention mechanisms for spatial and temporal modeling. A progressive training strategy is introduced, where OmniTokenizer is first trained on image data at a fixed resolution to develop spatial encoding capabilities, followed by joint training on image and video data at multiple resolutions to learn temporal dynamics. This approach enables OmniTokenizer to handle both image and video inputs within a unified framework, achieving state-of-the-art performance on various datasets, including 1.11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, outperforming previous methods by 13% and 26%, respectively. Extensive experiments show that OmniTokenizer achieves superior visual synthesis performance when integrated with language models and diffusion models. The method's effectiveness is validated through experiments on multiple datasets, demonstrating its versatility and scalability. The architecture includes a spatial-temporal transformer with patch embedding and separate spatial-temporal attention blocks, and a progressive training strategy that enhances visual encoding. The results highlight the superiority of OmniTokenizer in visual generation tasks, with significant improvements in reconstruction metrics and generation quality.OmniTokenizer is a transformer-based tokenizer designed for joint image and video tokenization, addressing the limitations of existing tokenizers that are specialized for either image or video inputs. It employs a spatial-temporal decoupled architecture, integrating window and causal attention mechanisms for spatial and temporal modeling. A progressive training strategy is introduced, where OmniTokenizer is first trained on image data at a fixed resolution to develop spatial encoding capabilities, followed by joint training on image and video data at multiple resolutions to learn temporal dynamics. This approach enables OmniTokenizer to handle both image and video inputs within a unified framework, achieving state-of-the-art performance on various datasets, including 1.11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, outperforming previous methods by 13% and 26%, respectively. Extensive experiments show that OmniTokenizer achieves superior visual synthesis performance when integrated with language models and diffusion models. The method's effectiveness is validated through experiments on multiple datasets, demonstrating its versatility and scalability. The architecture includes a spatial-temporal transformer with patch embedding and separate spatial-temporal attention blocks, and a progressive training strategy that enhances visual encoding. The results highlight the superiority of OmniTokenizer in visual generation tasks, with significant improvements in reconstruction metrics and generation quality.
Reach us at info@study.space
[slides and audio] OmniTokenizer%3A A Joint Image-Video Tokenizer for Visual Generation