[slides] Magic-Me%3A Identity-Specific Video Customized Diffusion

The paper introduces Video Custom Diffusion (VCD), a novel framework for generating high-quality, identity-specific videos. VCD addresses the challenge of controlling the identity of subjects in video generation, particularly for humans, by focusing on encoding identity information and maintaining frame-wise correlation. The framework consists of three stages: Text-to-Video (T2V) VCD, Face VCD, and Tiled VCD. Each stage enhances the identity characteristics and video quality: 1. **T2V VCD**: Initializes the video with a 3D Gaussian Noise Prior to ensure temporal consistency and stability. 2. **Face VCD**: Enhances facial details by cropping, upsampling, and regenerating faces while preserving identity features. 3. **Tiled VCD**: Upscales the video to higher resolution, maintaining identity consistency across frames. The paper also introduces an ID module based on extended Textual Inversion (TI) to disentangle identity information from the background, and a training-free 3D Gaussian Noise Prior to improve motion consistency. Extensive experiments demonstrate that VCD outperforms existing methods in generating stable, high-quality videos with accurate identity preservation. The framework is flexible and can be integrated with various text-to-image models, making it suitable for a wide range of applications.The paper introduces Video Custom Diffusion (VCD), a novel framework for generating high-quality, identity-specific videos. VCD addresses the challenge of controlling the identity of subjects in video generation, particularly for humans, by focusing on encoding identity information and maintaining frame-wise correlation. The framework consists of three stages: Text-to-Video (T2V) VCD, Face VCD, and Tiled VCD. Each stage enhances the identity characteristics and video quality: 1. **T2V VCD**: Initializes the video with a 3D Gaussian Noise Prior to ensure temporal consistency and stability. 2. **Face VCD**: Enhances facial details by cropping, upsampling, and regenerating faces while preserving identity features. 3. **Tiled VCD**: Upscales the video to higher resolution, maintaining identity consistency across frames. The paper also introduces an ID module based on extended Textual Inversion (TI) to disentangle identity information from the background, and a training-free 3D Gaussian Noise Prior to improve motion consistency. Extensive experiments demonstrate that VCD outperforms existing methods in generating stable, high-quality videos with accurate identity preservation. The framework is flexible and can be integrated with various text-to-image models, making it suitable for a wide range of applications.

Magic-Me: Identity-Specific Video Customized Diffusion

20 Mar 2024 | Ze Ma, Daquan Zhou, Xue-She Wang, Chun-Hsiao Yeh, Xiuyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, and Jiashi Feng