VideoTetris: Towards Compositional Text-to-Video Generation

VideoTetris: Towards Compositional Text-to-Video Generation

2024 | Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di Zhang, Bin Cui
VideoTetris is a novel framework for compositional text-to-video generation, addressing the limitations of existing methods in handling complex, long video generation scenarios involving multiple objects and dynamic changes. The framework introduces Spatio-Temporal Compositional Diffusion, which manipulates and composes attention maps of denoising networks spatially and temporally to precisely follow complex textual semantics. It also proposes an enhanced video data preprocessing pipeline to improve motion dynamics and prompt understanding, along with a Reference Frame Attention mechanism to maintain consistency in auto-regressive video generation. Extensive experiments show that VideoTetris achieves impressive qualitative and quantitative results in compositional text-to-video generation. The framework is designed to handle two primary tasks: (i) Video Generation with Compositional Prompts, which involves integrating objects with various attributes and relationships into a complex and coherent video; and (ii) Long Video Generation with Progressive Compositional Prompts, where 'progressive' refers to the continuous changes in the position, quantity, and presence of objects with different attributes and relationships. The Spatio-Temporal Compositional Diffusion method manipulates the cross-attention value of denoising networks temporally and spatially, synthesizing videos that faithfully follow complex or progressive instructions. The enhanced video data preprocessing pipeline augments the training data with enhanced motion dynamics and prompt semantics, enabling the model to perform more effectively in long video generation with progressive compositional generation. A consistency regularization method, namely Reference Frame Attention, maintains content consistency in a coherent representation space with latent noise while being capable of accepting arbitrary image inputs, ensuring the consistency of multiple objects across different frames and positions. VideoTetris demonstrates superior performance in both short and long video generation, accurately composing objects with their own attributes while maintaining their respective positions. It excels in generating long videos with progressive compositional prompts, seamlessly integrating new characters into the video scenes while maintaining consistent and accurate positional and quantity information. The framework's contributions include introducing a Spatio-Temporal Compositional Diffusion method for handling scenes with multiple objects and following progressive complex prompts, developing an Enhanced Video Data Preprocessing pipeline to enhance auto-regressive long video generation through motion dynamics and prompt semantics, proposing a consistency regularization method with Reference Frame Attention that maintains content coherence in compositional video generation, and extensive experiments showing that VideoTetris can generate state-of-the-art quality compositional videos and produce high-quality long videos that align with progressive compositional prompts while maintaining the best consistency.VideoTetris is a novel framework for compositional text-to-video generation, addressing the limitations of existing methods in handling complex, long video generation scenarios involving multiple objects and dynamic changes. The framework introduces Spatio-Temporal Compositional Diffusion, which manipulates and composes attention maps of denoising networks spatially and temporally to precisely follow complex textual semantics. It also proposes an enhanced video data preprocessing pipeline to improve motion dynamics and prompt understanding, along with a Reference Frame Attention mechanism to maintain consistency in auto-regressive video generation. Extensive experiments show that VideoTetris achieves impressive qualitative and quantitative results in compositional text-to-video generation. The framework is designed to handle two primary tasks: (i) Video Generation with Compositional Prompts, which involves integrating objects with various attributes and relationships into a complex and coherent video; and (ii) Long Video Generation with Progressive Compositional Prompts, where 'progressive' refers to the continuous changes in the position, quantity, and presence of objects with different attributes and relationships. The Spatio-Temporal Compositional Diffusion method manipulates the cross-attention value of denoising networks temporally and spatially, synthesizing videos that faithfully follow complex or progressive instructions. The enhanced video data preprocessing pipeline augments the training data with enhanced motion dynamics and prompt semantics, enabling the model to perform more effectively in long video generation with progressive compositional generation. A consistency regularization method, namely Reference Frame Attention, maintains content consistency in a coherent representation space with latent noise while being capable of accepting arbitrary image inputs, ensuring the consistency of multiple objects across different frames and positions. VideoTetris demonstrates superior performance in both short and long video generation, accurately composing objects with their own attributes while maintaining their respective positions. It excels in generating long videos with progressive compositional prompts, seamlessly integrating new characters into the video scenes while maintaining consistent and accurate positional and quantity information. The framework's contributions include introducing a Spatio-Temporal Compositional Diffusion method for handling scenes with multiple objects and following progressive complex prompts, developing an Enhanced Video Data Preprocessing pipeline to enhance auto-regressive long video generation through motion dynamics and prompt semantics, proposing a consistency regularization method with Reference Frame Attention that maintains content coherence in compositional video generation, and extensive experiments showing that VideoTetris can generate state-of-the-art quality compositional videos and produce high-quality long videos that align with progressive compositional prompts while maintaining the best consistency.
Reach us at info@study.space