[slides] Video Editing via Factorized Diffusion Distillation

The paper introduces Emu Video Edit (EVE), a state-of-the-art video editing model that does not rely on supervised video editing data. The key insight behind EVE is that video editing can be decomposed into two main capabilities: precise editing of individual frames and ensuring temporal consistency among the edited frames. To achieve this, the authors train two separate adapters— an image editing adapter and a video generation adapter—on top of a frozen text-to-image model. These adapters are then aligned using a new unsupervised distillation procedure called Factorized Diffusion Distillation (FDD). FDD simultaneously distills knowledge from one or more teachers, generating an edited video and providing supervision through Score Distillation Sampling and adversarial losses. The resulting model, EVE, achieves state-of-the-art results on the Text-Guided Video Editing (TGVE) benchmark and introduces additional metrics for temporal awareness. The paper also demonstrates the potential of the approach by aligning additional combinations of adapters for personalized and stylized image editing. The limitations of the approach include its performance being upper-bounded by the capabilities of the teacher models and the requirement for pre-trained adapters during training.The paper introduces Emu Video Edit (EVE), a state-of-the-art video editing model that does not rely on supervised video editing data. The key insight behind EVE is that video editing can be decomposed into two main capabilities: precise editing of individual frames and ensuring temporal consistency among the edited frames. To achieve this, the authors train two separate adapters— an image editing adapter and a video generation adapter—on top of a frozen text-to-image model. These adapters are then aligned using a new unsupervised distillation procedure called Factorized Diffusion Distillation (FDD). FDD simultaneously distills knowledge from one or more teachers, generating an edited video and providing supervision through Score Distillation Sampling and adversarial losses. The resulting model, EVE, achieves state-of-the-art results on the Text-Guided Video Editing (TGVE) benchmark and introduces additional metrics for temporal awareness. The paper also demonstrates the potential of the approach by aligning additional combinations of adapters for personalized and stylized image editing. The limitations of the approach include its performance being upper-bounded by the capabilities of the teacher models and the requirement for pre-trained adapters during training.

Video Editing via Factorized Diffusion Distillation

24 Mar 2024 | Uriel Singer*, Amit Zohar*, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, and Yaniv Taigman

24 Mar 2024 | Uriel Singer, Amit Zohar, Yuval Kirstain, Shelly Sheynin, Adam Polyak, Devi Parikh, and Yaniv Taigman