[slides] Customize-A-Video%3A One-Shot Motion Customization of Text-to-Video Diffusion Models

**Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models** **Authors:** Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava **Abstract:** Image customization has been extensively studied in text-to-image (T2I) diffusion models, but motion customization in text-to-video (T2V) models remains underexplored. To address this, we propose *Customize-A-Video*, a method that learns motion from a single reference video and adapts it to new subjects and scenes with spatial and temporal variations. We leverage Low-Rank Adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. Our method includes a three-stage training and inference pipeline: first, we train an *Appearance Absorber* to capture spatial information from the reference video; second, we apply Temporal LoRA (T-LoRA) on the temporal layers of the T2V model; finally, during inference, we remove the Appearance Absorber and load only the trained T-LoRA. This approach enables accurate and diverse motion customization, enhancing the dynamism and engagement of generated videos. **Contributions:** - We present a novel one-shot motion customization method for single reference video-based on pre-trained T2V diffusion models. - We introduce Temporal LoRA (T-LoRA) to learn specific motion from a reference video, facilitating motion transfer with accuracy and variety. - We propose an *Appearance Absorber* module to effectively decompose spatial information from motions. - Our modules are plug-and-play and can be extended to multiple downstream applications. **Methods:** - **Text-to-video diffusion models:** Train a 3D UNet to generate videos conditioned on text prompts. - **Low-Rank Adaptation (LoRA):** Apply LoRA to adapt pre-trained models to downstream tasks, focusing on spatial and temporal attention layers. - **Temporal LoRA (T-LoRA):** Learn motion characteristics from input videos and enable motion customization for new appearances via text prompts. - **Appearance Absorbers:** Separate spatial signals from temporal signals within a single video, using methods like Spatial LoRA (S-LoRA) and Textual Inversion. **Experiments:** - **Qualitative Results:** Compare with existing methods, demonstrating the ability to transfer reference motion to new scenarios and subjects with temporal variations. - **Quantitative Results:** Measure performance using metrics like CLIPScore, LPIPS, and diversity among generated videos. - **Ablations Studies:** Evaluate the impact of applying LoRA on different attention layers and comparing various Appearance Absorbers. **Applications:** - **Video Appearance Customization:** Combine motion and appearance customization. - **Multiple Motion Combination:** Integrate various movements into one outcome**Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models** **Authors:** Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava **Abstract:** Image customization has been extensively studied in text-to-image (T2I) diffusion models, but motion customization in text-to-video (T2V) models remains underexplored. To address this, we propose *Customize-A-Video*, a method that learns motion from a single reference video and adapts it to new subjects and scenes with spatial and temporal variations. We leverage Low-Rank Adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling. Our method includes a three-stage training and inference pipeline: first, we train an *Appearance Absorber* to capture spatial information from the reference video; second, we apply Temporal LoRA (T-LoRA) on the temporal layers of the T2V model; finally, during inference, we remove the Appearance Absorber and load only the trained T-LoRA. This approach enables accurate and diverse motion customization, enhancing the dynamism and engagement of generated videos. **Contributions:** - We present a novel one-shot motion customization method for single reference video-based on pre-trained T2V diffusion models. - We introduce Temporal LoRA (T-LoRA) to learn specific motion from a reference video, facilitating motion transfer with accuracy and variety. - We propose an *Appearance Absorber* module to effectively decompose spatial information from motions. - Our modules are plug-and-play and can be extended to multiple downstream applications. **Methods:** - **Text-to-video diffusion models:** Train a 3D UNet to generate videos conditioned on text prompts. - **Low-Rank Adaptation (LoRA):** Apply LoRA to adapt pre-trained models to downstream tasks, focusing on spatial and temporal attention layers. - **Temporal LoRA (T-LoRA):** Learn motion characteristics from input videos and enable motion customization for new appearances via text prompts. - **Appearance Absorbers:** Separate spatial signals from temporal signals within a single video, using methods like Spatial LoRA (S-LoRA) and Textual Inversion. **Experiments:** - **Qualitative Results:** Compare with existing methods, demonstrating the ability to transfer reference motion to new scenarios and subjects with temporal variations. - **Quantitative Results:** Measure performance using metrics like CLIPScore, LPIPS, and diversity among generated videos. - **Ablations Studies:** Evaluate the impact of applying LoRA on different attention layers and comparing various Appearance Absorbers. **Applications:** - **Video Appearance Customization:** Combine motion and appearance customization. - **Multiple Motion Combination:** Integrate various movements into one outcome

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

22 Feb 2024 | Yixuan Ren, Yang Zhou, Jimei Yang, Jing Shi, Difan Liu, Feng Liu, Mingi Kwon, Abhinav Shrivastava