IMAGE CONDUCTOR: PRECISION CONTROL FOR INTERACTIVE VIDEO SYNTHESIS

IMAGE CONDUCTOR: PRECISION CONTROL FOR INTERACTIVE VIDEO SYNTHESIS

21 Jun 2024 | Yaowei Li¹,², Xintao Wang², Zhaoyang Zhang²¹, Zhouxia Wang²,³, Ziyang Yuan²,⁴, Liangbin Xie²,⁵,⁶, Yuexian Zou¹, Ying Shan²
Image Conductor is a method for precise control of camera transitions and object movements to generate video assets from a single image. The method uses a training strategy to separate camera and object motion using camera LoRA and object LoRA weights. It also introduces a camera-free guidance technique during inference to enhance object movements while eliminating camera transitions. Additionally, a trajectory-oriented video motion data curation pipeline is developed for training. Quantitative and qualitative experiments demonstrate the method's precision and fine-grained control in generating motion-controllable videos from images, advancing the practical application of interactive video synthesis. The method addresses the challenge of separating camera transitions and object movements in real-world data, which is often mixed. It uses a video ControlNet trained on annotated data to convey motion information to the UNet backbone of the diffusion model. A collaborative optimization method applies distinct LoRA weights to distinguish various types of motion. An orthogonal loss is introduced to ensure the independence of different LoRA weights, enabling accurate motion disentanglement. To flexibly eliminate cinematographic variations caused by ill-posed trajectories, a new camera-free guidance technique is introduced. This technique iteratively executes an extrapolation fusion between different latents during the sampling process of diffusion models, similar to the classifier-free guidance technique. The main contributions of the paper include: (1) constructing a high-quality video motion dataset with precise trajectory annotations, (2) introducing a method to collaboratively optimize LoRA weights in motion ControlNet, (3) proposing camera-free guidance to heuristically eliminate camera transitions caused by multiple trajectories, and (4) extensive experiments demonstrating the superiority of the method in precisely and finely motion control. The method is evaluated against state-of-the-art methods, showing superior performance in terms of video content quality and motion control. It also demonstrates the ability to generate controllable video content assets, enabling personalized and controllable video synthesis. Ablation studies show that distinct LoRA weights enable precise separation and control of camera transitions and object movements. The camera-free guidance technique effectively enhances object movements while eliminating camera transitions.Image Conductor is a method for precise control of camera transitions and object movements to generate video assets from a single image. The method uses a training strategy to separate camera and object motion using camera LoRA and object LoRA weights. It also introduces a camera-free guidance technique during inference to enhance object movements while eliminating camera transitions. Additionally, a trajectory-oriented video motion data curation pipeline is developed for training. Quantitative and qualitative experiments demonstrate the method's precision and fine-grained control in generating motion-controllable videos from images, advancing the practical application of interactive video synthesis. The method addresses the challenge of separating camera transitions and object movements in real-world data, which is often mixed. It uses a video ControlNet trained on annotated data to convey motion information to the UNet backbone of the diffusion model. A collaborative optimization method applies distinct LoRA weights to distinguish various types of motion. An orthogonal loss is introduced to ensure the independence of different LoRA weights, enabling accurate motion disentanglement. To flexibly eliminate cinematographic variations caused by ill-posed trajectories, a new camera-free guidance technique is introduced. This technique iteratively executes an extrapolation fusion between different latents during the sampling process of diffusion models, similar to the classifier-free guidance technique. The main contributions of the paper include: (1) constructing a high-quality video motion dataset with precise trajectory annotations, (2) introducing a method to collaboratively optimize LoRA weights in motion ControlNet, (3) proposing camera-free guidance to heuristically eliminate camera transitions caused by multiple trajectories, and (4) extensive experiments demonstrating the superiority of the method in precisely and finely motion control. The method is evaluated against state-of-the-art methods, showing superior performance in terms of video content quality and motion control. It also demonstrates the ability to generate controllable video content assets, enabling personalized and controllable video synthesis. Ablation studies show that distinct LoRA weights enable precise separation and control of camera transitions and object movements. The camera-free guidance technique effectively enhances object movements while eliminating camera transitions.
Reach us at info@study.space