5 Mar 2024 | Litong Gong*, Yiran Zhu*, Weijie Li*, Xiaoyang Kang*, Biao Wang, Tiezheng Ge, Bo Zheng
AtomVideo is a high-fidelity image-to-video generation framework that achieves superior video generation by combining multi-granularity image injection and advanced training strategies. It can generate high-fidelity videos from input images, maintaining high fidelity and consistency while achieving strong motion intensity. AtomVideo can also generate text-to-video (T2V) content by integrating with advanced text-to-image (T2I) models. The framework is flexible and can be adapted to video frame prediction tasks, enabling long sequence generation through iterative methods. It also integrates well with existing personalized T2I models and controllable modules, offering more customized and controllable video generation.
The framework uses a pre-trained T2I model with added temporal and input layers, and injects image information through cross-attention and additional channels. It employs zero terminal Signal-to-Noise Ratio (SNR) and v-prediction strategies during training, which improve generation stability without a noisy prior. The model is trained on a 15M internal dataset, with input size of 512x512 and 24 frames. During inference, it uses Classifier-Free Guidance with both image and text conditional injection, enhancing output stability.
Quantitative evaluations show that AtomVideo outperforms other methods in image consistency, temporal consistency, and motion intensity. It achieves better results than commercial methods like Pika and Gen-2, especially in motion intensity. Qualitative samples demonstrate that AtomVideo generates more coherent and stable videos with greater motion intensity. It also shows good generalization across different resolutions and can be combined with personalized models like epiCRealism for light and shadow generation. The framework emphasizes fidelity to the given image, making it less compatible with stylistic models like cartoon style.
In conclusion, AtomVideo is a high-fidelity image-to-video generation framework that achieves excellent performance in maintaining temporal consistency and stability, especially in generating videos with high motion intensity. Future work aims to enhance controllability and expand to more powerful T2I base models.AtomVideo is a high-fidelity image-to-video generation framework that achieves superior video generation by combining multi-granularity image injection and advanced training strategies. It can generate high-fidelity videos from input images, maintaining high fidelity and consistency while achieving strong motion intensity. AtomVideo can also generate text-to-video (T2V) content by integrating with advanced text-to-image (T2I) models. The framework is flexible and can be adapted to video frame prediction tasks, enabling long sequence generation through iterative methods. It also integrates well with existing personalized T2I models and controllable modules, offering more customized and controllable video generation.
The framework uses a pre-trained T2I model with added temporal and input layers, and injects image information through cross-attention and additional channels. It employs zero terminal Signal-to-Noise Ratio (SNR) and v-prediction strategies during training, which improve generation stability without a noisy prior. The model is trained on a 15M internal dataset, with input size of 512x512 and 24 frames. During inference, it uses Classifier-Free Guidance with both image and text conditional injection, enhancing output stability.
Quantitative evaluations show that AtomVideo outperforms other methods in image consistency, temporal consistency, and motion intensity. It achieves better results than commercial methods like Pika and Gen-2, especially in motion intensity. Qualitative samples demonstrate that AtomVideo generates more coherent and stable videos with greater motion intensity. It also shows good generalization across different resolutions and can be combined with personalized models like epiCRealism for light and shadow generation. The framework emphasizes fidelity to the given image, making it less compatible with stylistic models like cartoon style.
In conclusion, AtomVideo is a high-fidelity image-to-video generation framework that achieves excellent performance in maintaining temporal consistency and stability, especially in generating videos with high motion intensity. Future work aims to enhance controllability and expand to more powerful T2I base models.