Understanding Video Diffusion Models are Training-free Motion Interpreter and Controller

This paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. The authors analyze how video diffusion models encode cross-frame motion information and present MOtion FeaTure (MOFT), a method that extracts motion information from pre-trained video diffusion models. MOFT is designed to remove content correlation information and filter motion channels, providing a distinct set of benefits, including high interpretability, training-free extraction, and generalizability across diverse architectures. Leveraging MOFT, the authors propose a training-free video motion control framework that demonstrates competitive performance in generating natural and faithful motion. The method is shown to be effective in various downstream tasks and is applicable to different video generation models without the need for independent training. The paper also includes experiments and qualitative results to validate the effectiveness of MOFT and the proposed motion control framework.This paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. The authors analyze how video diffusion models encode cross-frame motion information and present MOtion FeaTure (MOFT), a method that extracts motion information from pre-trained video diffusion models. MOFT is designed to remove content correlation information and filter motion channels, providing a distinct set of benefits, including high interpretability, training-free extraction, and generalizability across diverse architectures. Leveraging MOFT, the authors propose a training-free video motion control framework that demonstrates competitive performance in generating natural and faithful motion. The method is shown to be effective in various downstream tasks and is applicable to different video generation models without the need for independent training. The paper also includes experiments and qualitative results to validate the effectiveness of MOFT and the proposed motion control framework.

Video Diffusion Models are Training-free Motion Interpreter and Controller

23 May 2024 | Zeqi Xiao1, Yifan Zhou1, Shuai Yang2, Xingang Pan1*