Video Editing via Factorized Diffusion Distillation

Video Editing via Factorized Diffusion Distillation

24 Mar 2024 | Uriel Singer*, Amit Zohar*, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, and Yaniv Taigman
This paper introduces Emu Video Edit (EVE), a novel video editing model that achieves state-of-the-art results without relying on supervised video editing data. The model uses two separate adapters—an image editing adapter and a video generation adapter—trained on top of a shared text-to-image model. These adapters are aligned using a new unsupervised distillation method called Factorized Diffusion Distillation (FDD), which distills knowledge from multiple teachers to enable precise frame editing and temporal consistency in video editing. The resulting model, EVE, demonstrates strong performance on the Text Guided Video Editing (TGVE) benchmark, and the evaluation protocol is extended to include additional metrics and tasks. The approach is also shown to be applicable to other video editing tasks and can be used to align different combinations of adapters. The method is evaluated on various tasks, including zero-shot video editing, and shows significant improvements over existing methods. The paper also discusses limitations of the approach, including its reliance on pre-trained adapters and the need for careful alignment. Overall, the study presents a new and effective approach to video editing that can be applied to a wide range of tasks and models.This paper introduces Emu Video Edit (EVE), a novel video editing model that achieves state-of-the-art results without relying on supervised video editing data. The model uses two separate adapters—an image editing adapter and a video generation adapter—trained on top of a shared text-to-image model. These adapters are aligned using a new unsupervised distillation method called Factorized Diffusion Distillation (FDD), which distills knowledge from multiple teachers to enable precise frame editing and temporal consistency in video editing. The resulting model, EVE, demonstrates strong performance on the Text Guided Video Editing (TGVE) benchmark, and the evaluation protocol is extended to include additional metrics and tasks. The approach is also shown to be applicable to other video editing tasks and can be used to align different combinations of adapters. The method is evaluated on various tasks, including zero-shot video editing, and shows significant improvements over existing methods. The paper also discusses limitations of the approach, including its reliance on pre-trained adapters and the need for careful alignment. Overall, the study presents a new and effective approach to video editing that can be applied to a wide range of tasks and models.
Reach us at info@study.space