6 May 2024 | Andrew Melnik, Michal Ljubljana, Cong Lu, Qi Yan, Weiming Ren, Helge Ritter
Diffusion generative models have emerged as a robust technique for producing and modifying coherent, high-quality videos. This survey provides a systematic overview of critical elements of video diffusion models, including applications, architectural choices, and temporal dynamics modeling. Recent advancements are summarized and grouped into development trends, and the survey concludes with an overview of remaining challenges and future directions.
The survey categorizes video diffusion model applications into text-conditioned generation, image-conditioned video generation, video completion, audio-conditioned models, video editing, and intelligent decision-making. It reviews the mathematical formulation of diffusion generative models, focusing on the denoising process and the training objective. Popular architectures such as UNets and transformers are discussed, along with variations like latent diffusion models and cascaded diffusion models.
Key challenges in video diffusion models include maintaining temporal consistency, generating long videos, and computational costs. The survey explores spatio-temporal attention mechanisms, temporal upsampling techniques, and methods for preserving structural integrity during video editing.
Training and evaluation methods are also covered, including the use of video and image datasets, and evaluation metrics such as FID, FVD, KVD, and IS. The survey provides an overview of benchmark datasets and evaluates the performance of various video generation models, highlighting the strengths and limitations of different approaches.
Overall, the survey aims to provide a comprehensive understanding of video diffusion models, their applications, and the ongoing research efforts to address the challenges in this field.Diffusion generative models have emerged as a robust technique for producing and modifying coherent, high-quality videos. This survey provides a systematic overview of critical elements of video diffusion models, including applications, architectural choices, and temporal dynamics modeling. Recent advancements are summarized and grouped into development trends, and the survey concludes with an overview of remaining challenges and future directions.
The survey categorizes video diffusion model applications into text-conditioned generation, image-conditioned video generation, video completion, audio-conditioned models, video editing, and intelligent decision-making. It reviews the mathematical formulation of diffusion generative models, focusing on the denoising process and the training objective. Popular architectures such as UNets and transformers are discussed, along with variations like latent diffusion models and cascaded diffusion models.
Key challenges in video diffusion models include maintaining temporal consistency, generating long videos, and computational costs. The survey explores spatio-temporal attention mechanisms, temporal upsampling techniques, and methods for preserving structural integrity during video editing.
Training and evaluation methods are also covered, including the use of video and image datasets, and evaluation metrics such as FID, FVD, KVD, and IS. The survey provides an overview of benchmark datasets and evaluates the performance of various video generation models, highlighting the strengths and limitations of different approaches.
Overall, the survey aims to provide a comprehensive understanding of video diffusion models, their applications, and the ongoing research efforts to address the challenges in this field.