EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

5 Jul 2024 | Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, Jun Huang
EasyAnimate is a high-performance method for long video generation based on the transformer architecture. It expands the DiT framework originally designed for 2D image synthesis to accommodate 3D video generation by incorporating a special motion module block called Hybrid Motion Module. This module combines temporal and global attention to ensure coherent frames and smooth motion transitions. Additionally, Slice VAE is introduced to compress the temporal axis, facilitating the generation of long videos. EasyAnimate can generate videos up to 144 frames from images of varying resolutions. It provides a comprehensive ecosystem for video production, including data preprocessing, VAE training, DiT model training, and end-to-end video inference. The system is designed to handle both image and text prompts, enabling image-guided video generation. The architecture includes a text encoder, video VAE, and diffusion transformer (DiT). Slice VAE employs different decoding methods for images and videos, reducing memory usage as video length increases. The video diffusion transformer uses a special motion module and U-ViT connection to enhance training stability. The training process involves three stages: training the video VAE, adapting the DiT model to the new VAE, and refining the DiT model with high-quality video data. The system uses a bucket strategy to train with different video resolutions. EasyAnimate is capable of adapting to various frame counts and resolutions during training and inference, making it suitable for generating both images and videos.EasyAnimate is a high-performance method for long video generation based on the transformer architecture. It expands the DiT framework originally designed for 2D image synthesis to accommodate 3D video generation by incorporating a special motion module block called Hybrid Motion Module. This module combines temporal and global attention to ensure coherent frames and smooth motion transitions. Additionally, Slice VAE is introduced to compress the temporal axis, facilitating the generation of long videos. EasyAnimate can generate videos up to 144 frames from images of varying resolutions. It provides a comprehensive ecosystem for video production, including data preprocessing, VAE training, DiT model training, and end-to-end video inference. The system is designed to handle both image and text prompts, enabling image-guided video generation. The architecture includes a text encoder, video VAE, and diffusion transformer (DiT). Slice VAE employs different decoding methods for images and videos, reducing memory usage as video length increases. The video diffusion transformer uses a special motion module and U-ViT connection to enhance training stability. The training process involves three stages: training the video VAE, adapting the DiT model to the new VAE, and refining the DiT model with high-quality video data. The system uses a bucket strategy to train with different video resolutions. EasyAnimate is capable of adapting to various frame counts and resolutions during training and inference, making it suitable for generating both images and videos.
Reach us at info@study.space