[slides] Motion-I2V%3A Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

**Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling** **Authors:** Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li **Institutional Affiliations:** The Chinese University of Hong Kong, NVIDIA AI Technology Center, SenseTime Research, Tsinghua University, Centre for Perceptual and Interactive Intelligence (CPII), Shanghai AI Laboratory **Abstract:** Motion-I2V is a novel framework for consistent and controllable image-to-video generation (I2V). Unlike previous methods that directly learn the complex image-to-video mapping, Motion-I2V factors I2V into two stages with explicit motion modeling. The first stage uses a diffusion-based motion field predictor to deduce the trajectories of reference image pixels. The second stage employs motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models, effectively propagating reference image features to synthesized frames guided by predicted trajectories. This approach ensures more consistent videos, even with large motion and viewpoint changes. Motion-I2V also supports precise control over motion trajectories and regions using sparse trajectory annotations and region-specific animations. Additionally, it naturally supports zero-shot video-to-video translation. Qualitative and quantitative comparisons demonstrate Motion-I2V's superior performance in consistent and controllable I2V generation. **Introduction:** Image-to-video generation (I2V) aims to animate a given image into a video with natural dynamics while preserving visual appearance. Traditional I2V methods are often specialized for specific categories, limiting their utility in diverse, open-domain scenarios. Recent advancements in diffusion models have shown promise in producing high-quality and diverse images, but they struggle with temporal consistency and controllability in I2V tasks. Motion-I2V addresses these issues by decoupling motion modeling and video detail generation, enhancing temporal receptive fields, and providing fine-grained control over the I2V process. **Methods:** - **Latent Diffusion Model (LDM):** A backbone generative model that conducts denoising in the latent space of a Variational Autoencoder (VAE). - **Video Latent Diffusion Model (VLDM):** Extends the LDM by incorporating temporal modules to capture temporal dependencies. - **Motion Prediction with Video Diffusion Models:** Uses a pre-trained LDM to predict motion fields conditioned on the reference image and text prompt. - **Video Rendering with Predicted Motion:** Enhances the vanilla 1-D temporal attention with motion-augmented temporal attention, guided by predicted motion fields. **Control and Applications:** - **Sparse Trajectory Guided I2V:** Allows users to specify desired motions using sparse trajectory inputs. - **Region-Specific I2V:** Animates only user**Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling** **Authors:** Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li **Institutional Affiliations:** The Chinese University of Hong Kong, NVIDIA AI Technology Center, SenseTime Research, Tsinghua University, Centre for Perceptual and Interactive Intelligence (CPII), Shanghai AI Laboratory **Abstract:** Motion-I2V is a novel framework for consistent and controllable image-to-video generation (I2V). Unlike previous methods that directly learn the complex image-to-video mapping, Motion-I2V factors I2V into two stages with explicit motion modeling. The first stage uses a diffusion-based motion field predictor to deduce the trajectories of reference image pixels. The second stage employs motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models, effectively propagating reference image features to synthesized frames guided by predicted trajectories. This approach ensures more consistent videos, even with large motion and viewpoint changes. Motion-I2V also supports precise control over motion trajectories and regions using sparse trajectory annotations and region-specific animations. Additionally, it naturally supports zero-shot video-to-video translation. Qualitative and quantitative comparisons demonstrate Motion-I2V's superior performance in consistent and controllable I2V generation. **Introduction:** Image-to-video generation (I2V) aims to animate a given image into a video with natural dynamics while preserving visual appearance. Traditional I2V methods are often specialized for specific categories, limiting their utility in diverse, open-domain scenarios. Recent advancements in diffusion models have shown promise in producing high-quality and diverse images, but they struggle with temporal consistency and controllability in I2V tasks. Motion-I2V addresses these issues by decoupling motion modeling and video detail generation, enhancing temporal receptive fields, and providing fine-grained control over the I2V process. **Methods:** - **Latent Diffusion Model (LDM):** A backbone generative model that conducts denoising in the latent space of a Variational Autoencoder (VAE). - **Video Latent Diffusion Model (VLDM):** Extends the LDM by incorporating temporal modules to capture temporal dependencies. - **Motion Prediction with Video Diffusion Models:** Uses a pre-trained LDM to predict motion fields conditioned on the reference image and text prompt. - **Video Rendering with Predicted Motion:** Enhances the vanilla 1-D temporal attention with motion-augmented temporal attention, guided by predicted motion fields. **Control and Applications:** - **Sparse Trajectory Guided I2V:** Allows users to specify desired motions using sparse trajectory inputs. - **Region-Specific I2V:** Animates only user

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

31 Jan 2024 | Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li