20 Mar 2024 | Ze Ma¹, Daquan Zhou¹, Xue-She Wang¹, Chun-Hsiao Yeh², Xiuyu Li², Huanrui Yang², Zhen Dong², Kurt Keutzer², and Jiashi Feng¹
The paper introduces Video Custom Diffusion (VCD), a framework for generating identity-specific videos with high quality and stability. VCD is designed to maintain the identity characteristics of a subject across frames while ensuring temporal consistency. The framework incorporates a 3D Gaussian Noise Prior to enhance inter-frame stability during video generation. This prior is trained-free and helps in reconstructing the correlation across frames, leading to more stable video outputs. Additionally, an ID module is proposed to disentangle identity information from the background, enabling better alignment with user prompts. The ID module is extended from a single text token to multiple tokens to encode more accurate identity information, improving the quality of learned identity.
The VCD framework consists of three stages: T2V VCD, Face VCD, and Tiled VCD. T2V VCD generates initial videos with low resolution, incorporating identity characteristics. Face VCD enhances facial details by cropping, upsampling, and regenerating faces with more identity-specific details. Tiled VCD further upscales the video to higher resolution while preserving the identity's features. The framework is compatible with various text-to-image models and can be adapted to different conditional inputs such as poses, depths, and emotions.
Experiments show that VCD outperforms existing methods in generating stable videos with better identity preservation. The framework is evaluated using quantitative metrics such as identity alignment, text alignment, and temporal smoothness. Results demonstrate that VCD achieves a better balance between identity preservation and text alignment. The framework is also shown to be effective in generating videos with controllable emotions and motions, making it suitable for applications in the film industry and other domains requiring identity-specific video generation. The proposed method offers a modular approach to identity-specific video generation, enhancing flexibility and practicality for content creation in AI-generated content communities.The paper introduces Video Custom Diffusion (VCD), a framework for generating identity-specific videos with high quality and stability. VCD is designed to maintain the identity characteristics of a subject across frames while ensuring temporal consistency. The framework incorporates a 3D Gaussian Noise Prior to enhance inter-frame stability during video generation. This prior is trained-free and helps in reconstructing the correlation across frames, leading to more stable video outputs. Additionally, an ID module is proposed to disentangle identity information from the background, enabling better alignment with user prompts. The ID module is extended from a single text token to multiple tokens to encode more accurate identity information, improving the quality of learned identity.
The VCD framework consists of three stages: T2V VCD, Face VCD, and Tiled VCD. T2V VCD generates initial videos with low resolution, incorporating identity characteristics. Face VCD enhances facial details by cropping, upsampling, and regenerating faces with more identity-specific details. Tiled VCD further upscales the video to higher resolution while preserving the identity's features. The framework is compatible with various text-to-image models and can be adapted to different conditional inputs such as poses, depths, and emotions.
Experiments show that VCD outperforms existing methods in generating stable videos with better identity preservation. The framework is evaluated using quantitative metrics such as identity alignment, text alignment, and temporal smoothness. Results demonstrate that VCD achieves a better balance between identity preservation and text alignment. The framework is also shown to be effective in generating videos with controllable emotions and motions, making it suitable for applications in the film industry and other domains requiring identity-specific video generation. The proposed method offers a modular approach to identity-specific video generation, enhancing flexibility and practicality for content creation in AI-generated content communities.