[slides] CustomVideo%3A Customizing Text-to-Video Generation with Multiple Subjects

**CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects** This paper introduces CustomVideo, a novel framework for generating high-quality videos guided by text prompts and subject references, focusing on multi-subject customization. The key challenge addressed is ensuring the co-occurrence of multiple subjects in the generated video while preserving their identities. CustomVideo employs a simple yet effective co-occurrence and attention control mechanism. During training, multiple subjects are composed into a single image to encourage their simultaneous presence. An attention control strategy disentangles different subjects in the diffusion model's latent space, using ground truth object masks to guide the model to focus on specific areas. Extensive experiments on a comprehensive benchmark dataset, CustomStudio, demonstrate the superiority of CustomVideo over previous state-of-the-art methods in terms of qualitative, quantitative, and user study results. The method consistently outperforms competitors in metrics such as CLIP Image Alignment and DINO Image Alignment, achieving 11.99% and 23.39% improvements, respectively. CustomVideo also excels in handling challenging scenarios involving visually similar objects and generating videos with rich motions. The paper includes detailed experimental setups, evaluation metrics, and ablation studies to validate the effectiveness of the proposed approach.**CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects** This paper introduces CustomVideo, a novel framework for generating high-quality videos guided by text prompts and subject references, focusing on multi-subject customization. The key challenge addressed is ensuring the co-occurrence of multiple subjects in the generated video while preserving their identities. CustomVideo employs a simple yet effective co-occurrence and attention control mechanism. During training, multiple subjects are composed into a single image to encourage their simultaneous presence. An attention control strategy disentangles different subjects in the diffusion model's latent space, using ground truth object masks to guide the model to focus on specific areas. Extensive experiments on a comprehensive benchmark dataset, CustomStudio, demonstrate the superiority of CustomVideo over previous state-of-the-art methods in terms of qualitative, quantitative, and user study results. The method consistently outperforms competitors in metrics such as CLIP Image Alignment and DINO Image Alignment, achieving 11.99% and 23.39% improvements, respectively. CustomVideo also excels in handling challenging scenarios involving visually similar objects and generating videos with rich motions. The paper includes detailed experimental setups, evaluation metrics, and ablation studies to validate the effectiveness of the proposed approach.

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

22 May 2024 | Zhao Wang1, Aoxue Li2, Lingting Zhu3, Yong Guo2, Qi Dou1, Zhenguo Li2