Object-Centric Diffusion for Efficient Video Editing

Object-Centric Diffusion for Efficient Video Editing

30 Aug 2024 | Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, and Amirhossein Habibian
Object-Centric Diffusion for Efficient Video Editing This paper introduces Object-Centric Diffusion (OCD), a method to enhance the efficiency of diffusion-based video editing. Diffusion models have achieved impressive quality in video editing, but they are computationally expensive due to their sampling process and cross-frame attention mechanisms. OCD addresses these inefficiencies by focusing computational resources on foreground regions, which are more critical for perceptual quality. OCD introduces two key techniques: Object-Centric Sampling and Object-Centric Token Merging. Object-Centric Sampling decouples the diffusion process, allocating more steps to foreground regions and fewer to background regions. Object-Centric Token Merging reduces the number of cross-frame attention tokens by merging redundant tokens in background regions. These techniques significantly reduce memory and computational costs without compromising quality. The paper evaluates OCD on inversion-based and control-signal-based editing pipelines, achieving up to 10× faster editing speeds with comparable quality. The results show that OCD improves both the quality of foreground and background regions while drastically reducing generation costs. The method is applicable to various video editing models without retraining. The paper also discusses the efficiency bottlenecks in video editing, including memory operations and cross-frame attention. It explores off-the-shelf acceleration techniques such as faster noise schedulers and token merging, which can be combined with OCD to further improve efficiency. OCD is validated on two video editing models, FateZero and ControlVideo, achieving significant speedups and reduced memory consumption. The method is robust to different saliency masks and works well for both small and large foreground objects. However, it is less effective for global editing tasks and requires careful hyperparameter tuning for zero-shot video editing. The paper concludes that OCD provides a significant improvement in the efficiency of diffusion-based video editing, making it a valuable tool for real-time video editing applications.Object-Centric Diffusion for Efficient Video Editing This paper introduces Object-Centric Diffusion (OCD), a method to enhance the efficiency of diffusion-based video editing. Diffusion models have achieved impressive quality in video editing, but they are computationally expensive due to their sampling process and cross-frame attention mechanisms. OCD addresses these inefficiencies by focusing computational resources on foreground regions, which are more critical for perceptual quality. OCD introduces two key techniques: Object-Centric Sampling and Object-Centric Token Merging. Object-Centric Sampling decouples the diffusion process, allocating more steps to foreground regions and fewer to background regions. Object-Centric Token Merging reduces the number of cross-frame attention tokens by merging redundant tokens in background regions. These techniques significantly reduce memory and computational costs without compromising quality. The paper evaluates OCD on inversion-based and control-signal-based editing pipelines, achieving up to 10× faster editing speeds with comparable quality. The results show that OCD improves both the quality of foreground and background regions while drastically reducing generation costs. The method is applicable to various video editing models without retraining. The paper also discusses the efficiency bottlenecks in video editing, including memory operations and cross-frame attention. It explores off-the-shelf acceleration techniques such as faster noise schedulers and token merging, which can be combined with OCD to further improve efficiency. OCD is validated on two video editing models, FateZero and ControlVideo, achieving significant speedups and reduced memory consumption. The method is robust to different saliency masks and works well for both small and large foreground objects. However, it is less effective for global editing tasks and requires careful hyperparameter tuning for zero-shot video editing. The paper concludes that OCD provides a significant improvement in the efficiency of diffusion-based video editing, making it a valuable tool for real-time video editing applications.
Reach us at info@futurestudyspace.com