Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

31 Jan 2024 | Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Jifeng Dai, Hongsheng Li
Motion-I2V is a novel framework for consistent and controllable image-to-video generation. It addresses the limitations of existing methods by factorizing the image-to-video generation process into two stages with explicit motion modeling. In the first stage, a diffusion-based motion field predictor is used to infer pixel-wise trajectories of the reference image. This motion field is then used in the second stage to generate consistent video frames through motion-augmented temporal attention, which enhances the 1-D temporal attention in video diffusion models. The framework also supports sparse trajectory control and motion brush for precise motion control. Additionally, the second stage enables zero-shot video-to-video translation. Motion-I2V outperforms existing methods in terms of consistency and controllability, especially in handling large motions and viewpoint changes. The framework is trained on a large-scale text-video dataset and uses a combination of diffusion models and ControlNet for motion prediction. It supports region-specific animation and video-to-video translation, offering enhanced controllability over the I2V process. The method is evaluated on various benchmarks and shows superior performance in generating temporally consistent and visually appealing videos.Motion-I2V is a novel framework for consistent and controllable image-to-video generation. It addresses the limitations of existing methods by factorizing the image-to-video generation process into two stages with explicit motion modeling. In the first stage, a diffusion-based motion field predictor is used to infer pixel-wise trajectories of the reference image. This motion field is then used in the second stage to generate consistent video frames through motion-augmented temporal attention, which enhances the 1-D temporal attention in video diffusion models. The framework also supports sparse trajectory control and motion brush for precise motion control. Additionally, the second stage enables zero-shot video-to-video translation. Motion-I2V outperforms existing methods in terms of consistency and controllability, especially in handling large motions and viewpoint changes. The framework is trained on a large-scale text-video dataset and uses a combination of diffusion models and ControlNet for motion prediction. It supports region-specific animation and video-to-video translation, offering enhanced controllability over the I2V process. The method is evaluated on various benchmarks and shows superior performance in generating temporally consistent and visually appealing videos.
Reach us at info@study.space
Understanding Motion-I2V%3A Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling