IMAGEN VIDEO: HIGH DEFINITION VIDEO GENERATION WITH DIFFUSION MODELS

IMAGEN VIDEO: HIGH DEFINITION VIDEO GENERATION WITH DIFFUSION MODELS

2022 | Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans*
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models. It generates high-definition videos by combining a base video generation model with a sequence of interleaved spatial and temporal video super-resolution models. The system is designed to scale up to high-definition video generation, achieving 1280×768 resolution and 24 frames per second. Key contributions include the effectiveness of cascaded diffusion models for high-definition video generation, the transfer of findings from text-to-image settings to video generation, and the introduction of progressive distillation for fast, high-quality sampling. Imagen Video demonstrates strong temporal consistency, deep language understanding, and the ability to generate diverse videos with artistic styles and 3D object understanding. The model uses a frozen T5 text encoder for conditioning and employs v-prediction parameterization to improve sample quality and convergence. Progressive distillation techniques are applied to reduce sampling time while maintaining perceptual quality. Experiments show that Imagen Video can generate high-fidelity videos with various artistic styles and 3D structures, highlighting its potential for creative applications.Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models. It generates high-definition videos by combining a base video generation model with a sequence of interleaved spatial and temporal video super-resolution models. The system is designed to scale up to high-definition video generation, achieving 1280×768 resolution and 24 frames per second. Key contributions include the effectiveness of cascaded diffusion models for high-definition video generation, the transfer of findings from text-to-image settings to video generation, and the introduction of progressive distillation for fast, high-quality sampling. Imagen Video demonstrates strong temporal consistency, deep language understanding, and the ability to generate diverse videos with artistic styles and 3D object understanding. The model uses a frozen T5 text encoder for conditioning and employs v-prediction parameterization to improve sample quality and convergence. Progressive distillation techniques are applied to reduce sampling time while maintaining perceptual quality. Experiments show that Imagen Video can generate high-fidelity videos with various artistic styles and 3D structures, highlighting its potential for creative applications.
Reach us at info@study.space
[slides] Imagen Video%3A High Definition Video Generation with Diffusion Models | StudySpace