Object-Centric Diffusion for Efficient Video Editing

Object-Centric Diffusion for Efficient Video Editing

30 Aug 2024 | Kumara Kahatapitiya*, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, and Amirhossein Habibian
The paper "Object-Centric Diffusion for Efficient Video Editing" by Kumara Kahatapitiya et al. addresses the inefficiencies in diffusion-based video editing models, particularly in terms of memory and computational costs. The authors identify key bottlenecks such as attention-based guidance and cross-frame attention, which significantly increase latency. They propose two main techniques: Object-Centric Sampling and Object-Centric Token Merging (ToMe). Object-Centric Sampling decouples the diffusion process into foreground and background regions, focusing most of the denoising steps on the foreground. Object-Centric ToMe reduces the cost of cross-frame attention by merging redundant tokens in background regions. These techniques are applied to inversion-based and ControlNet-based video editing models, achieving up to 10× speedup in latency while maintaining comparable quality. The paper also includes a detailed analysis of the impact of these techniques on different types of foreground objects and provides qualitative and quantitative evaluations to demonstrate their effectiveness.The paper "Object-Centric Diffusion for Efficient Video Editing" by Kumara Kahatapitiya et al. addresses the inefficiencies in diffusion-based video editing models, particularly in terms of memory and computational costs. The authors identify key bottlenecks such as attention-based guidance and cross-frame attention, which significantly increase latency. They propose two main techniques: Object-Centric Sampling and Object-Centric Token Merging (ToMe). Object-Centric Sampling decouples the diffusion process into foreground and background regions, focusing most of the denoising steps on the foreground. Object-Centric ToMe reduces the cost of cross-frame attention by merging redundant tokens in background regions. These techniques are applied to inversion-based and ControlNet-based video editing models, achieving up to 10× speedup in latency while maintaining comparable quality. The paper also includes a detailed analysis of the impact of these techniques on different types of foreground objects and provides qualitative and quantitative evaluations to demonstrate their effectiveness.
Reach us at info@study.space
Understanding Object-Centric Diffusion for Efficient Video Editing