31 Jan 2024 | Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Jifeng Dai, Hongsheng Li
Motion-I2V is a novel framework for consistent and controllable image-to-video generation. It addresses the limitations of existing methods by factorizing the image-to-video generation process into two stages with explicit motion modeling. In the first stage, a diffusion-based motion field predictor is used to infer pixel-wise trajectories of the reference image. This motion field is then used in the second stage to generate consistent video frames through motion-augmented temporal attention, which enhances the 1-D temporal attention in video diffusion models. The framework also supports sparse trajectory control and motion brush for precise motion control. Additionally, the second stage enables zero-shot video-to-video translation. Motion-I2V outperforms existing methods in terms of consistency and controllability, especially in handling large motions and viewpoint changes. The framework is trained on a large-scale text-video dataset and uses a combination of diffusion models and ControlNet for motion prediction. It supports region-specific animation and video-to-video translation, offering enhanced controllability over the I2V process. The method is evaluated on various benchmarks and shows superior performance in generating temporally consistent and visually appealing videos.Motion-I2V is a novel framework for consistent and controllable image-to-video generation. It addresses the limitations of existing methods by factorizing the image-to-video generation process into two stages with explicit motion modeling. In the first stage, a diffusion-based motion field predictor is used to infer pixel-wise trajectories of the reference image. This motion field is then used in the second stage to generate consistent video frames through motion-augmented temporal attention, which enhances the 1-D temporal attention in video diffusion models. The framework also supports sparse trajectory control and motion brush for precise motion control. Additionally, the second stage enables zero-shot video-to-video translation. Motion-I2V outperforms existing methods in terms of consistency and controllability, especially in handling large motions and viewpoint changes. The framework is trained on a large-scale text-video dataset and uses a combination of diffusion models and ControlNet for motion prediction. It supports region-specific animation and video-to-video translation, offering enhanced controllability over the I2V process. The method is evaluated on various benchmarks and shows superior performance in generating temporally consistent and visually appealing videos.