28 Dec 2023 | Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis
The paper introduces Video Latent Diffusion Models (Video LDMs) for high-resolution video synthesis, addressing the challenge of generating long-term, temporally coherent videos efficiently. The authors leverage pre-trained image diffusion models (LDMs) and modify them to generate videos by introducing temporal layers that align frames in a consistent manner. They also fine-tune these models on video data to improve temporal consistency and spatial resolution. Key contributions include:
1. **Efficient Video Generation**: By pre-training on images and fine-tuning on videos, the method reduces computational demands while maintaining high-quality video synthesis.
2. **High-Resolution Video Synthesis**: The Video LDMs achieve state-of-the-art performance on 512 × 1024 resolution real driving scene videos, generating videos up to 5 minutes long.
3. **Text-to-Video Generation**: The method transforms the publicly available Stable Diffusion text-to-image LDM into a powerful text-to-video generator with resolutions up to 1280 × 2048.
4. **Personalized Text-to-Video**: The learned temporal layers can be combined with different image model checkpoints, enabling personalized text-to-video generation for the first time.
The paper evaluates the models using metrics such as Fréchet Inception Distance (FID), Fréchet Video Distance (FVD), and human evaluation, demonstrating superior performance compared to existing baselines. The authors also discuss the broader impact and limitations of their work, emphasizing the ethical considerations and the need for ethically sourced data in future research.The paper introduces Video Latent Diffusion Models (Video LDMs) for high-resolution video synthesis, addressing the challenge of generating long-term, temporally coherent videos efficiently. The authors leverage pre-trained image diffusion models (LDMs) and modify them to generate videos by introducing temporal layers that align frames in a consistent manner. They also fine-tune these models on video data to improve temporal consistency and spatial resolution. Key contributions include:
1. **Efficient Video Generation**: By pre-training on images and fine-tuning on videos, the method reduces computational demands while maintaining high-quality video synthesis.
2. **High-Resolution Video Synthesis**: The Video LDMs achieve state-of-the-art performance on 512 × 1024 resolution real driving scene videos, generating videos up to 5 minutes long.
3. **Text-to-Video Generation**: The method transforms the publicly available Stable Diffusion text-to-image LDM into a powerful text-to-video generator with resolutions up to 1280 × 2048.
4. **Personalized Text-to-Video**: The learned temporal layers can be combined with different image model checkpoints, enabling personalized text-to-video generation for the first time.
The paper evaluates the models using metrics such as Fréchet Inception Distance (FID), Fréchet Video Distance (FVD), and human evaluation, demonstrating superior performance compared to existing baselines. The authors also discuss the broader impact and limitations of their work, emphasizing the ethical considerations and the need for ethically sourced data in future research.