Video Diffusion Models are Training-free Motion Interpreter and Controller

Video Diffusion Models are Training-free Motion Interpreter and Controller

23 May 2024 | Zeqi Xiao, Yifan Zhou, Shuai Yang, Xingang Pan
This paper introduces a novel training-free motion feature (MOFT) for video diffusion models, enabling interpretable and controllable motion in video generation. MOFT is derived by removing content correlation information and filtering motion channels from video diffusion features. Through Principal Component Analysis (PCA), the study reveals that robust motion-aware features already exist in video diffusion models. MOFT provides clear interpretability, can be extracted without training, and is generalizable across various architectures. The proposed MOFT is used to develop a training-free video motion control framework, which demonstrates competitive performance in generating natural and faithful motion. Unlike previous training-based methods, this approach is applicable to different architectures and checkpoints without requiring independent training. The framework leverages compositional loss functions for content manipulation and enables point-drag manipulation using MOFT guidance. Experiments show that MOFT achieves high motion fidelity and image quality, outperforming some data-driven methods. The method is versatile, applicable to various video generation models, and provides architecture-agnostic insights. The study also highlights the importance of understanding motion encoding in video diffusion models for improving motion control and downstream applications.This paper introduces a novel training-free motion feature (MOFT) for video diffusion models, enabling interpretable and controllable motion in video generation. MOFT is derived by removing content correlation information and filtering motion channels from video diffusion features. Through Principal Component Analysis (PCA), the study reveals that robust motion-aware features already exist in video diffusion models. MOFT provides clear interpretability, can be extracted without training, and is generalizable across various architectures. The proposed MOFT is used to develop a training-free video motion control framework, which demonstrates competitive performance in generating natural and faithful motion. Unlike previous training-based methods, this approach is applicable to different architectures and checkpoints without requiring independent training. The framework leverages compositional loss functions for content manipulation and enables point-drag manipulation using MOFT guidance. Experiments show that MOFT achieves high motion fidelity and image quality, outperforming some data-driven methods. The method is versatile, applicable to various video generation models, and provides architecture-agnostic insights. The study also highlights the importance of understanding motion encoding in video diffusion models for improving motion control and downstream applications.
Reach us at info@study.space
[slides] Video Diffusion Models are Training-free Motion Interpreter and Controller | StudySpace