Diffusion Model-Based Video Editing: A Survey

Diffusion Model-Based Video Editing: A Survey

26 Jun 2024 | Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, and Dacheng Tao, Fellow, IEEE
This paper provides a comprehensive survey of diffusion model-based video editing techniques, covering theoretical foundations, practical applications, and recent advancements. It begins with an overview of diffusion models, including their mathematical formulation and image generation methods, followed by an exploration of video generation and motion representation. The paper then categorizes diffusion model-based video editing approaches into five primary classes based on their underlying technologies. A new benchmarking tool, V2VBench, is introduced, encompassing four text-guided video editing tasks, along with a detailed evaluation and analysis. The paper concludes with an overview of open challenges and potential directions for future research. Diffusion models have emerged as a leading approach for vision generation tasks, including Text-to-Image (T2I) generation and Image-to-Image (I2I) translation. These models can generate high-quality semantically accurate images from text descriptions or modify input images based on specified conditions. Extending these models to video generation and editing has sparked considerable research interest, although challenges remain in adapting designs for static images to handle videos' dynamic and temporal aspects. The scarcity of high-quality video datasets adds further technical difficulties. Recent advancements in diffusion model-based video generative methods have been witnessed, with the most influential areas being generating videos from text inputs and editing existing videos generatively. Editing existing videos does not require prohibitively expensive video pre-training and allows fine-grained control of the source video, leading to diverse applications. The paper discusses various techniques for image editing, including latent state initialization, attention feature injection, and text inversion. It also explores efficient adaptations such as Low-Rank Adaptation (LoRA) and Token Merging (ToMe), which enhance the performance of diffusion models in video editing tasks. In video generation, diffusion models have been extended to create videos from scratch, leveraging the underlying technologies of video editing. Recent advancements in diffusion models for video generation have accelerated progress, surpassing previous GANs. The paper reviews key technologies in video generation and introduces optical flow as a motion representation technique widely used in various video-related tasks. The paper also discusses various approaches for video editing, including temporal adaptation, structure conditioning, and training modification. These methods collectively reveal the evolving effectiveness and complexity of video editing tasks. The paper concludes with an overview of open challenges and potential directions for future research.This paper provides a comprehensive survey of diffusion model-based video editing techniques, covering theoretical foundations, practical applications, and recent advancements. It begins with an overview of diffusion models, including their mathematical formulation and image generation methods, followed by an exploration of video generation and motion representation. The paper then categorizes diffusion model-based video editing approaches into five primary classes based on their underlying technologies. A new benchmarking tool, V2VBench, is introduced, encompassing four text-guided video editing tasks, along with a detailed evaluation and analysis. The paper concludes with an overview of open challenges and potential directions for future research. Diffusion models have emerged as a leading approach for vision generation tasks, including Text-to-Image (T2I) generation and Image-to-Image (I2I) translation. These models can generate high-quality semantically accurate images from text descriptions or modify input images based on specified conditions. Extending these models to video generation and editing has sparked considerable research interest, although challenges remain in adapting designs for static images to handle videos' dynamic and temporal aspects. The scarcity of high-quality video datasets adds further technical difficulties. Recent advancements in diffusion model-based video generative methods have been witnessed, with the most influential areas being generating videos from text inputs and editing existing videos generatively. Editing existing videos does not require prohibitively expensive video pre-training and allows fine-grained control of the source video, leading to diverse applications. The paper discusses various techniques for image editing, including latent state initialization, attention feature injection, and text inversion. It also explores efficient adaptations such as Low-Rank Adaptation (LoRA) and Token Merging (ToMe), which enhance the performance of diffusion models in video editing tasks. In video generation, diffusion models have been extended to create videos from scratch, leveraging the underlying technologies of video editing. Recent advancements in diffusion models for video generation have accelerated progress, surpassing previous GANs. The paper reviews key technologies in video generation and introduces optical flow as a motion representation technique widely used in various video-related tasks. The paper also discusses various approaches for video editing, including temporal adaptation, structure conditioning, and training modification. These methods collectively reveal the evolving effectiveness and complexity of video editing tasks. The paper concludes with an overview of open challenges and potential directions for future research.
Reach us at info@study.space
[slides and audio] Diffusion Model-Based Video Editing%3A A Survey