[slides] VideoTetris%3A Towards Compositional Text-to-Video Generation

VideoTetris is a novel framework designed to address the challenges of compositional text-to-video (T2V) generation, particularly in scenarios involving complex scenes with multiple objects and dynamic changes. The framework introduces a Spatio-Temporal Compositional Diffusion module that manipulates attention maps of denoising networks to accurately follow complex textual semantics. Additionally, it enhances video data preprocessing to improve motion dynamics and prompt understanding, and introduces a Reference Frame Attention mechanism to maintain consistency in auto-regressive video generation. The key contributions of VideoTetris include: 1. **Spatio-Temporal Compositional Diffusion**: This method allows for the natural integration and blending of objects during the denoising process, ensuring coherent and realistic video outputs. 2. **Enhanced Video Data Preprocessing**: This pipeline enhances motion dynamics and prompt semantics, improving the quality of long video generation. 3. **Reference Frame Attention**: This regularization method maintains content coherence by ensuring consistent object appearances across different frames and positions. Experiments demonstrate that VideoTetris outperforms existing methods in both qualitative and quantitative evaluations, achieving superior results in generating high-quality, compositional videos, especially in scenarios with complex and progressive compositional prompts. The framework's effectiveness is further validated through extensive experiments and ablation studies, highlighting its robustness and efficiency in handling compositional video generation tasks.VideoTetris is a novel framework designed to address the challenges of compositional text-to-video (T2V) generation, particularly in scenarios involving complex scenes with multiple objects and dynamic changes. The framework introduces a Spatio-Temporal Compositional Diffusion module that manipulates attention maps of denoising networks to accurately follow complex textual semantics. Additionally, it enhances video data preprocessing to improve motion dynamics and prompt understanding, and introduces a Reference Frame Attention mechanism to maintain consistency in auto-regressive video generation. The key contributions of VideoTetris include: 1. **Spatio-Temporal Compositional Diffusion**: This method allows for the natural integration and blending of objects during the denoising process, ensuring coherent and realistic video outputs. 2. **Enhanced Video Data Preprocessing**: This pipeline enhances motion dynamics and prompt semantics, improving the quality of long video generation. 3. **Reference Frame Attention**: This regularization method maintains content coherence by ensuring consistent object appearances across different frames and positions. Experiments demonstrate that VideoTetris outperforms existing methods in both qualitative and quantitative evaluations, achieving superior results in generating high-quality, compositional videos, especially in scenarios with complex and progressive compositional prompts. The framework's effectiveness is further validated through extensive experiments and ablation studies, highlighting its robustness and efficiency in handling compositional video generation tasks.

VideoTetris: Towards Compositional Text-to-Video Generation

14 Oct 2024 | Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di Zhang, Bin Cui