21 Mar 2024 | Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi
StreamingT2V is an advanced autoregressive method for generating long videos from text, ensuring temporal consistency, rich motion dynamics, and high-quality frame-level image quality. It enables the creation of videos up to 1200 frames (2 minutes) with smooth transitions and can be extended for even longer durations. The method introduces two key components: a Conditional Attention Module (CAM) for short-term memory, which conditions the current generation on previous frames to ensure consistent chunk transitions, and an Appearance Preservation Module (APM) for long-term memory, which preserves scene and object features across the video generation process. Additionally, a randomized blending approach is used to enhance video quality and ensure seamless transitions between chunks. Experiments show that StreamingT2V generates high-motion videos without stagnation, outperforming existing methods in terms of consistency and motion. The method is trained using a pre-trained text-to-video model and enhanced with a high-resolution video model for autoregressive generation. StreamingT2V demonstrates superior performance in terms of temporal consistency, motion amount, and text alignment compared to other baselines, making it a powerful tool for generating high-quality long videos from text.StreamingT2V is an advanced autoregressive method for generating long videos from text, ensuring temporal consistency, rich motion dynamics, and high-quality frame-level image quality. It enables the creation of videos up to 1200 frames (2 minutes) with smooth transitions and can be extended for even longer durations. The method introduces two key components: a Conditional Attention Module (CAM) for short-term memory, which conditions the current generation on previous frames to ensure consistent chunk transitions, and an Appearance Preservation Module (APM) for long-term memory, which preserves scene and object features across the video generation process. Additionally, a randomized blending approach is used to enhance video quality and ensure seamless transitions between chunks. Experiments show that StreamingT2V generates high-motion videos without stagnation, outperforming existing methods in terms of consistency and motion. The method is trained using a pre-trained text-to-video model and enhanced with a high-resolution video model for autoregressive generation. StreamingT2V demonstrates superior performance in terms of temporal consistency, motion amount, and text alignment compared to other baselines, making it a powerful tool for generating high-quality long videos from text.