28 Dec 2023 | Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis
This paper introduces Video Latent Diffusion Models (Video LDMs) for high-resolution video synthesis. The approach extends latent diffusion models (LDMs), which are efficient for image generation, to video generation by introducing temporal alignment layers. The key idea is to pre-train LDMs on images and then adapt them to generate videos by adding temporal dimensions and fine-tuning on encoded video sequences. This allows the model to generate temporally consistent videos while maintaining high resolution. The method is validated on real-world driving data, achieving state-of-the-art performance. Additionally, the approach leverages pre-trained image LDMs, such as Stable Diffusion, to generate high-resolution text-to-video models. The temporal layers are trained to align frames and ensure consistency across the video. The method also includes temporal interpolation for high frame rates and video super-resolution using diffusion models. The results show that the Video LDMs can generate high-quality, long videos with high resolution, and they can be adapted to different image models for personalized text-to-video generation. The paper also discusses the application of Video LDMs in autonomous driving simulation and creative content generation. The method is efficient, computationally and memory-friendly, and can be applied to various video generation tasks. The experiments demonstrate the effectiveness of the approach on real-world driving data and text-to-video generation, showing that the Video LDMs outperform existing methods in terms of video quality and resolution. The paper concludes that Video LDMs provide a powerful and efficient solution for high-resolution video synthesis.This paper introduces Video Latent Diffusion Models (Video LDMs) for high-resolution video synthesis. The approach extends latent diffusion models (LDMs), which are efficient for image generation, to video generation by introducing temporal alignment layers. The key idea is to pre-train LDMs on images and then adapt them to generate videos by adding temporal dimensions and fine-tuning on encoded video sequences. This allows the model to generate temporally consistent videos while maintaining high resolution. The method is validated on real-world driving data, achieving state-of-the-art performance. Additionally, the approach leverages pre-trained image LDMs, such as Stable Diffusion, to generate high-resolution text-to-video models. The temporal layers are trained to align frames and ensure consistency across the video. The method also includes temporal interpolation for high frame rates and video super-resolution using diffusion models. The results show that the Video LDMs can generate high-quality, long videos with high resolution, and they can be adapted to different image models for personalized text-to-video generation. The paper also discusses the application of Video LDMs in autonomous driving simulation and creative content generation. The method is efficient, computationally and memory-friendly, and can be applied to various video generation tasks. The experiments demonstrate the effectiveness of the approach on real-world driving data and text-to-video generation, showing that the Video LDMs outperform existing methods in terms of video quality and resolution. The paper concludes that Video LDMs provide a powerful and efficient solution for high-resolution video synthesis.