Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

25 Nov 2023 | Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach
Stable Video Diffusion is a latent video diffusion model designed for high-resolution text-to-video and image-to-video generation. The paper presents a systematic approach to curate video data for training, consisting of three stages: text-to-image pretraining, video pretraining on a large dataset at low resolution, and high-resolution video finetuning on a smaller, high-quality dataset. The authors demonstrate that a well-curated pretraining dataset significantly improves video generation quality and introduce a method to systematically curate video data. They also show that their base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. The model is shown to provide a strong multi-view 3D prior and can serve as a base for finetuning a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. The model is trained on a large dataset of 580 million annotated video clip pairs and is capable of generating high-quality videos. The paper also presents results on various tasks, including text-to-video, image-to-video, and frame interpolation, showing that the model outperforms existing methods. The authors conclude that their approach provides a strong 3D prior and achieves state-of-the-art results in multi-view synthesis.Stable Video Diffusion is a latent video diffusion model designed for high-resolution text-to-video and image-to-video generation. The paper presents a systematic approach to curate video data for training, consisting of three stages: text-to-image pretraining, video pretraining on a large dataset at low resolution, and high-resolution video finetuning on a smaller, high-quality dataset. The authors demonstrate that a well-curated pretraining dataset significantly improves video generation quality and introduce a method to systematically curate video data. They also show that their base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. The model is shown to provide a strong multi-view 3D prior and can serve as a base for finetuning a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. The model is trained on a large dataset of 580 million annotated video clip pairs and is capable of generating high-quality videos. The paper also presents results on various tasks, including text-to-video, image-to-video, and frame interpolation, showing that the model outperforms existing methods. The authors conclude that their approach provides a strong 3D prior and achieves state-of-the-art results in multi-view synthesis.
Reach us at info@study.space
[slides and audio] Stable Video Diffusion%3A Scaling Latent Video Diffusion Models to Large Datasets