22 Feb 2024 | Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava
**Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models**
**Authors:** Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava
**Abstract:**
Image customization has been extensively studied in text-to-image (T2I) diffusion models, but motion customization in text-to-video (T2V) models remains underexplored. To address this, we propose *Customize-A-Video*, a method that learns motion from a single reference video and adapts it to new subjects and scenes with spatial and temporal variations. We leverage Low-Rank Adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. Our method includes a three-stage training and inference pipeline: first, we train an *Appearance Absorber* to capture spatial information from the reference video; second, we apply Temporal LoRA (T-LoRA) on the temporal layers of the T2V model; finally, during inference, we remove the Appearance Absorber and load only the trained T-LoRA. This approach enables accurate and diverse motion customization, enhancing the dynamism and engagement of generated videos.
**Contributions:**
- We present a novel one-shot motion customization method for single reference video-based on pre-trained T2V diffusion models.
- We introduce Temporal LoRA (T-LoRA) to learn specific motion from a reference video, facilitating motion transfer with accuracy and variety.
- We propose an *Appearance Absorber* module to effectively decompose spatial information from motions.
- Our modules are plug-and-play and can be extended to multiple downstream applications.
**Methods:**
- **Text-to-video diffusion models:** Train a 3D UNet to generate videos conditioned on text prompts.
- **Low-Rank Adaptation (LoRA):** Apply LoRA to adapt pre-trained models to downstream tasks, focusing on spatial and temporal attention layers.
- **Temporal LoRA (T-LoRA):** Learn motion characteristics from input videos and enable motion customization for new appearances via text prompts.
- **Appearance Absorbers:** Separate spatial signals from temporal signals within a single video, using methods like Spatial LoRA (S-LoRA) and Textual Inversion.
**Experiments:**
- **Qualitative Results:** Compare with existing methods, demonstrating the ability to transfer reference motion to new scenarios and subjects with temporal variations.
- **Quantitative Results:** Measure performance using metrics like CLIPScore, LPIPS, and diversity among generated videos.
- **Ablations Studies:** Evaluate the impact of applying LoRA on different attention layers and comparing various Appearance Absorbers.
**Applications:**
- **Video Appearance Customization:** Combine motion and appearance customization.
- **Multiple Motion Combination:** Integrate various movements into one outcome**Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models**
**Authors:** Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava
**Abstract:**
Image customization has been extensively studied in text-to-image (T2I) diffusion models, but motion customization in text-to-video (T2V) models remains underexplored. To address this, we propose *Customize-A-Video*, a method that learns motion from a single reference video and adapts it to new subjects and scenes with spatial and temporal variations. We leverage Low-Rank Adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. Our method includes a three-stage training and inference pipeline: first, we train an *Appearance Absorber* to capture spatial information from the reference video; second, we apply Temporal LoRA (T-LoRA) on the temporal layers of the T2V model; finally, during inference, we remove the Appearance Absorber and load only the trained T-LoRA. This approach enables accurate and diverse motion customization, enhancing the dynamism and engagement of generated videos.
**Contributions:**
- We present a novel one-shot motion customization method for single reference video-based on pre-trained T2V diffusion models.
- We introduce Temporal LoRA (T-LoRA) to learn specific motion from a reference video, facilitating motion transfer with accuracy and variety.
- We propose an *Appearance Absorber* module to effectively decompose spatial information from motions.
- Our modules are plug-and-play and can be extended to multiple downstream applications.
**Methods:**
- **Text-to-video diffusion models:** Train a 3D UNet to generate videos conditioned on text prompts.
- **Low-Rank Adaptation (LoRA):** Apply LoRA to adapt pre-trained models to downstream tasks, focusing on spatial and temporal attention layers.
- **Temporal LoRA (T-LoRA):** Learn motion characteristics from input videos and enable motion customization for new appearances via text prompts.
- **Appearance Absorbers:** Separate spatial signals from temporal signals within a single video, using methods like Spatial LoRA (S-LoRA) and Textual Inversion.
**Experiments:**
- **Qualitative Results:** Compare with existing methods, demonstrating the ability to transfer reference motion to new scenarios and subjects with temporal variations.
- **Quantitative Results:** Measure performance using metrics like CLIPScore, LPIPS, and diversity among generated videos.
- **Ablations Studies:** Evaluate the impact of applying LoRA on different attention layers and comparing various Appearance Absorbers.
**Applications:**
- **Video Appearance Customization:** Combine motion and appearance customization.
- **Multiple Motion Combination:** Integrate various movements into one outcome