CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

22 May 2024 | Zhao Wang, Aoxue Li, Lingting Zhu, Yong Guo, Qi Dou, Zhenguo Li
CustomVideo is a novel framework for multi-subject text-to-video generation, designed to generate high-quality videos with multiple subjects while preserving their identities and smooth motions. The framework uses a simple yet effective co-occurrence and attention control mechanism to disentangle different subjects. It introduces learnable word tokens for each subject, enabling the model to focus on the corresponding subject area during training. The framework also incorporates a ground truth object mask obtained through segmentation to guide the attention learning process. A comprehensive dataset, CustomStudio, is collected for benchmarking, containing 63 individual subjects from 13 categories and 68 meaningful pairs. Extensive experiments show that CustomVideo outperforms previous state-of-the-art methods in terms of CLIP Image Alignment, DINO Image Alignment, and Temporal Consistency. The framework is efficient, requiring only subject images for training and can generate high-quality videos with specific subjects using a text prompt. CustomVideo also demonstrates superior performance in handling challenging scenarios involving visually similar objects. The framework is implemented based on Diffusers and is capable of generating high-resolution videos without additional training cost. The method is evaluated through qualitative results, quantitative metrics, and user studies, demonstrating its effectiveness in generating high-quality videos with customized subjects. The framework is also tested on various scenarios, including failure cases, and shows limitations in handling too many subjects or small faces. Overall, CustomVideo provides a strong baseline for subject-driven applications, particularly in multi-subject scenarios.CustomVideo is a novel framework for multi-subject text-to-video generation, designed to generate high-quality videos with multiple subjects while preserving their identities and smooth motions. The framework uses a simple yet effective co-occurrence and attention control mechanism to disentangle different subjects. It introduces learnable word tokens for each subject, enabling the model to focus on the corresponding subject area during training. The framework also incorporates a ground truth object mask obtained through segmentation to guide the attention learning process. A comprehensive dataset, CustomStudio, is collected for benchmarking, containing 63 individual subjects from 13 categories and 68 meaningful pairs. Extensive experiments show that CustomVideo outperforms previous state-of-the-art methods in terms of CLIP Image Alignment, DINO Image Alignment, and Temporal Consistency. The framework is efficient, requiring only subject images for training and can generate high-quality videos with specific subjects using a text prompt. CustomVideo also demonstrates superior performance in handling challenging scenarios involving visually similar objects. The framework is implemented based on Diffusers and is capable of generating high-resolution videos without additional training cost. The method is evaluated through qualitative results, quantitative metrics, and user studies, demonstrating its effectiveness in generating high-quality videos with customized subjects. The framework is also tested on various scenarios, including failure cases, and shows limitations in handling too many subjects or small faces. Overall, CustomVideo provides a strong baseline for subject-driven applications, particularly in multi-subject scenarios.
Reach us at info@study.space
Understanding CustomVideo%3A Customizing Text-to-Video Generation with Multiple Subjects