22 Feb 2024 | Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models
This paper proposes Customize-A-Video, a method for one-shot motion customization of text-to-video diffusion models. The method models motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal variations. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling from the reference videos. To disentangle spatial and temporal information during the training pipeline, a novel concept of appearance absorbers is introduced that detach the original appearance from the single reference video prior to motion learning.
The proposed method can be easily extended to various downstream tasks, including custom video generation and editing, video appearance customization, and multiple motion combination. The method introduces a Temporal LoRA (T-LoRA) module to learn the specific motion from a reference video, facilitating motion transfer with not only accuracy but also variety. A new Appearance Absorber module is also proposed to effectively decompose the spatial information, strategically excluding it from the motion customization process. The method features the plug-and-play nature and can be smoothly extended to multiple downstream applications.
The method is evaluated on multiple datasets and compared with existing methods. The results show that the proposed method achieves faithful and diverse videos compared to both per-frame video editing approaches and the base T2V model. The method is plug-and-play and supports various downstream tasks including precise video editing, video appearance customization and multiple motion combination.Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models
This paper proposes Customize-A-Video, a method for one-shot motion customization of text-to-video diffusion models. The method models motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal variations. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling from the reference videos. To disentangle spatial and temporal information during the training pipeline, a novel concept of appearance absorbers is introduced that detach the original appearance from the single reference video prior to motion learning.
The proposed method can be easily extended to various downstream tasks, including custom video generation and editing, video appearance customization, and multiple motion combination. The method introduces a Temporal LoRA (T-LoRA) module to learn the specific motion from a reference video, facilitating motion transfer with not only accuracy but also variety. A new Appearance Absorber module is also proposed to effectively decompose the spatial information, strategically excluding it from the motion customization process. The method features the plug-and-play nature and can be smoothly extended to multiple downstream applications.
The method is evaluated on multiple datasets and compared with existing methods. The results show that the proposed method achieves faithful and diverse videos compared to both per-frame video editing approaches and the base T2V model. The method is plug-and-play and supports various downstream tasks including precise video editing, video appearance customization and multiple motion combination.