5 Jul 2024 | Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, Jun Huang
This paper introduces EasyAnimate, an advanced method for generating high-quality videos using the transformer architecture. The method leverages the DiT framework, originally designed for 2D image synthesis, to handle the complexities of 3D video generation by incorporating a Hybrid Motion Module. This module combines temporal and global attention mechanisms to ensure coherent frames and seamless motion transitions. Additionally, the Slice VAE technique is introduced to compress the temporal dimension, reducing memory usage and enabling the generation of long videos. The paper provides a comprehensive ecosystem for video production, including data preprocessing, VAE training, DiT model training, and end-to-end video inference. EasyAnimate can generate videos of up to 144 frames from images of varying resolutions. The contributions of the paper include the development of EasyAnimate, the exploration of temporal information in video generation, and the proposal of Slice VAE for efficient video compression. The paper also discusses related work in video VAEs and video diffusion models, and details the architecture and training process of EasyAnimate.This paper introduces EasyAnimate, an advanced method for generating high-quality videos using the transformer architecture. The method leverages the DiT framework, originally designed for 2D image synthesis, to handle the complexities of 3D video generation by incorporating a Hybrid Motion Module. This module combines temporal and global attention mechanisms to ensure coherent frames and seamless motion transitions. Additionally, the Slice VAE technique is introduced to compress the temporal dimension, reducing memory usage and enabling the generation of long videos. The paper provides a comprehensive ecosystem for video production, including data preprocessing, VAE training, DiT model training, and end-to-end video inference. EasyAnimate can generate videos of up to 144 frames from images of varying resolutions. The contributions of the paper include the development of EasyAnimate, the exploration of temporal information in video generation, and the proposal of Slice VAE for efficient video compression. The paper also discusses related work in video VAEs and video diffusion models, and details the architecture and training process of EasyAnimate.