[slides and audio] DreamMotion%3A Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing

**DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing** **Authors:** Hyeonho Jeong, Jinho Chang, Geon Yeong Park, Jong Chul Ye **Institution:** Kim Jaechul Graduate School of AI, KAIST, South Korea; Dept. of Bio and Brain Engineering, KAIST, South Korea **Project Page:** https://hyeonho99.github.io/dreammotion **Abstract:** Text-driven diffusion-based video editing faces the challenge of establishing real-world motion. Unlike existing methods, DreamMotion focuses on score distillation sampling to avoid the standard reverse diffusion process and initiate optimization from videos with natural motion. While video score distillation can effectively introduce new content, it can also cause significant structure and motion deviations. To address this, DreamMotion proposes matching the space-time self-similarities of the original and edited videos during score distillation. This approach is model-agnostic and can be applied to both cascaded and non-cascaded video diffusion frameworks. Extensive comparisons with leading methods demonstrate its superiority in altering appearances while preserving the original structure and motion. **Introduction:** Text-driven diffusion models have revolutionized image editing, but extending this to video editing introduces challenges due to the need for temporally consistent, real-world motion. Existing methods often rely on inflated attention layers or visual hints to guide the reverse diffusion process, but these methods struggle to achieve smooth and complete motion. DreamMotion diverges from these approaches by leveraging score distillation sampling to edit videos, starting from an input video with natural motion. The method uses Delta Denoising Score (DDS) gradients to modify the appearance while maintaining the integrity of the motion. To address structural errors, DreamMotion introduces space-time regularization methods that align spatial and temporal self-similarities between the original and edited videos. **Background:** The paper reviews diffusion models, conditional generation, and video diffusion models, providing a foundation for understanding the proposed method. **DreamMotion:** DreamMotion is a pioneering zero-shot framework that distills video scores from text-to-video diffusion priors to inject target appearance. It includes three key components: appearance injection using DDS, structure correction through spatial self-similarity matching, and temporal smoothing via temporal self-similarity matching. The method is applied to both cascaded and non-cascaded video diffusion frameworks. **Experiments:** The paper evaluates DreamMotion on the DAVIS and WebVid datasets, comparing it against baselines in both qualitative and quantitative evaluations. Results show that DreamMotion produces temporally consistent videos that closely adhere to the target prompt while preserving the motion of the input video. Quantitative metrics and user studies further validate the method's superior performance in textual alignment, frame consistency, and structure and motion preservation. **Conclusion:** DreamMotion addresses the challenge of diffusion-based video editing by using score distillation and space-time self-similarity alignment. It demonstrates superior performance in maintaining the structural**DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing** **Authors:** Hyeonho Jeong, Jinho Chang, Geon Yeong Park, Jong Chul Ye **Institution:** Kim Jaechul Graduate School of AI, KAIST, South Korea; Dept. of Bio and Brain Engineering, KAIST, South Korea **Project Page:** https://hyeonho99.github.io/dreammotion **Abstract:** Text-driven diffusion-based video editing faces the challenge of establishing real-world motion. Unlike existing methods, DreamMotion focuses on score distillation sampling to avoid the standard reverse diffusion process and initiate optimization from videos with natural motion. While video score distillation can effectively introduce new content, it can also cause significant structure and motion deviations. To address this, DreamMotion proposes matching the space-time self-similarities of the original and edited videos during score distillation. This approach is model-agnostic and can be applied to both cascaded and non-cascaded video diffusion frameworks. Extensive comparisons with leading methods demonstrate its superiority in altering appearances while preserving the original structure and motion. **Introduction:** Text-driven diffusion models have revolutionized image editing, but extending this to video editing introduces challenges due to the need for temporally consistent, real-world motion. Existing methods often rely on inflated attention layers or visual hints to guide the reverse diffusion process, but these methods struggle to achieve smooth and complete motion. DreamMotion diverges from these approaches by leveraging score distillation sampling to edit videos, starting from an input video with natural motion. The method uses Delta Denoising Score (DDS) gradients to modify the appearance while maintaining the integrity of the motion. To address structural errors, DreamMotion introduces space-time regularization methods that align spatial and temporal self-similarities between the original and edited videos. **Background:** The paper reviews diffusion models, conditional generation, and video diffusion models, providing a foundation for understanding the proposed method. **DreamMotion:** DreamMotion is a pioneering zero-shot framework that distills video scores from text-to-video diffusion priors to inject target appearance. It includes three key components: appearance injection using DDS, structure correction through spatial self-similarity matching, and temporal smoothing via temporal self-similarity matching. The method is applied to both cascaded and non-cascaded video diffusion frameworks. **Experiments:** The paper evaluates DreamMotion on the DAVIS and WebVid datasets, comparing it against baselines in both qualitative and quantitative evaluations. Results show that DreamMotion produces temporally consistent videos that closely adhere to the target prompt while preserving the motion of the input video. Quantitative metrics and user studies further validate the method's superior performance in textual alignment, frame consistency, and structure and motion preservation. **Conclusion:** DreamMotion addresses the challenge of diffusion-based video editing by using score distillation and space-time self-similarity alignment. It demonstrates superior performance in maintaining the structural

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

15 Jul 2024 | Hyeonho Jeong1, Jinho Chang1, Geon Yeong Park2, and Jong Chul Ye1,2