2022 | Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans*
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, it generates high-definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. The system is designed to scale up to high-definition text-to-video generation, incorporating design choices such as fully-convolutional temporal and spatial super-resolution models and the v-parameterization of diffusion models. It also transfers findings from previous image generation work to video generation, and applies progressive distillation with classifier-free guidance for fast, high-quality sampling. Imagen Video can generate high-fidelity videos with high controllability and world knowledge, including diverse videos and text animations in various artistic styles and with 3D object understanding. The system produces videos at 1280×768 resolution, 24 frames per second, and 5.3 seconds in duration. It demonstrates strong temporal consistency and high-quality video generation, with results showing diverse and temporally coherent videos aligned with the given prompts. The system uses a cascaded diffusion model architecture, including a frozen T5 text encoder, a base video diffusion model, and interleaved spatial and temporal super-resolution models. It also uses v-prediction parameterization for numerical stability and progressive distillation for faster sampling. The system is trained on a combination of internal and publicly available datasets, and shows improvements in perceptual quality metrics such as CLIP scores and FID. The system is capable of generating videos in various artistic styles and with 3D object understanding, and has been tested on a variety of text prompts. The system is also efficient in terms of sampling time and computational resources, with distilled models being significantly faster than the original models. However, there are concerns about the potential misuse of generative models, and the system has not been released publicly until these concerns are addressed.Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, it generates high-definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. The system is designed to scale up to high-definition text-to-video generation, incorporating design choices such as fully-convolutional temporal and spatial super-resolution models and the v-parameterization of diffusion models. It also transfers findings from previous image generation work to video generation, and applies progressive distillation with classifier-free guidance for fast, high-quality sampling. Imagen Video can generate high-fidelity videos with high controllability and world knowledge, including diverse videos and text animations in various artistic styles and with 3D object understanding. The system produces videos at 1280×768 resolution, 24 frames per second, and 5.3 seconds in duration. It demonstrates strong temporal consistency and high-quality video generation, with results showing diverse and temporally coherent videos aligned with the given prompts. The system uses a cascaded diffusion model architecture, including a frozen T5 text encoder, a base video diffusion model, and interleaved spatial and temporal super-resolution models. It also uses v-prediction parameterization for numerical stability and progressive distillation for faster sampling. The system is trained on a combination of internal and publicly available datasets, and shows improvements in perceptual quality metrics such as CLIP scores and FID. The system is capable of generating videos in various artistic styles and with 3D object understanding, and has been tested on a variety of text prompts. The system is also efficient in terms of sampling time and computational resources, with distilled models being significantly faster than the original models. However, there are concerns about the potential misuse of generative models, and the system has not been released publicly until these concerns are addressed.