25 Nov 2023 | Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach
Stable Video Diffusion (SVD) is a latent video diffusion model designed for high-resolution text-to-video and image-to-video generation. The paper addresses the challenge of training video diffusion models by identifying and evaluating three key stages: text-to-image pretraining, video pretraining, and high-quality video finetuning. It introduces a systematic approach to curate video data, including methods for data processing and annotation, to create a quality dataset suitable for generative video modeling. The authors demonstrate that pretraining on well-curated datasets significantly improves performance, which persists even after high-quality finetuning. They also show that their model provides a strong motion representation and can be fine-tuned for various downstream tasks, such as image-to-video generation and multi-view synthesis. The model outperforms state-of-the-art methods in terms of visual quality and multi-view consistency, achieving this with a fraction of the compute budget. The paper concludes with a discussion on the broader impact and limitations of their work.Stable Video Diffusion (SVD) is a latent video diffusion model designed for high-resolution text-to-video and image-to-video generation. The paper addresses the challenge of training video diffusion models by identifying and evaluating three key stages: text-to-image pretraining, video pretraining, and high-quality video finetuning. It introduces a systematic approach to curate video data, including methods for data processing and annotation, to create a quality dataset suitable for generative video modeling. The authors demonstrate that pretraining on well-curated datasets significantly improves performance, which persists even after high-quality finetuning. They also show that their model provides a strong motion representation and can be fine-tuned for various downstream tasks, such as image-to-video generation and multi-view synthesis. The model outperforms state-of-the-art methods in terms of visual quality and multi-view consistency, achieving this with a fraction of the compute budget. The paper concludes with a discussion on the broader impact and limitations of their work.