[slides and audio] MS-Diffusion%3A Multi-subject Zero-shot Image Personalization with Layout Guidance

MS-Diffusion is a novel framework for multi-subject zero-shot image personalization with layout guidance. It addresses the challenges of maintaining subject detail fidelity and achieving cohesive multi-subject representations in text-to-image generation. The framework integrates grounding tokens with a feature resampler to preserve detail fidelity among subjects. With layout guidance, MS-Diffusion enhances cross-attention to adapt to multi-subject inputs, ensuring each subject condition acts on specific areas. The proposed multi-subject cross-attention orchestrates harmonious inter-subject compositions while preserving text control. Comprehensive experiments show that MS-Diffusion outperforms existing models in both image and text fidelity, promoting personalized text-to-image generation. The framework introduces a grounding resampler to extract subject features and integrate them with grounding information. It also proposes a novel multi-subject cross-attention mechanism that confines subjects to specific areas, reducing conflicts between text and image conditions. The model is trained without pre-trained weights and remains plug-and-play across various base models. MS-Diffusion effectively handles multi-subject personalization by maintaining natural interactions among subjects and preserving their distinctiveness and recognizability. Quantitative results demonstrate its superior performance in DINO, M-DINO, and CLIP-T metrics. The model also shows strong image fidelity, highlighting its ability to retain details, especially in cases of low text fidelity. The framework addresses the challenges of multi-subject personalization by incorporating explicit layout guidance, which helps in resolving conflicts and ensuring accurate subject representation. The model's ability to generate images that adhere to layout conditions, even in cases of repeated subjects, demonstrates its effectiveness. MS-Diffusion is compatible with existing controllable tools like ControlNet, enabling the generation of highly controllable images without fine-tuning. The model's integration with ControlNet allows for the synthesis of images under varied structural directives, showcasing its versatility and effectiveness in multi-subject personalization tasks.MS-Diffusion is a novel framework for multi-subject zero-shot image personalization with layout guidance. It addresses the challenges of maintaining subject detail fidelity and achieving cohesive multi-subject representations in text-to-image generation. The framework integrates grounding tokens with a feature resampler to preserve detail fidelity among subjects. With layout guidance, MS-Diffusion enhances cross-attention to adapt to multi-subject inputs, ensuring each subject condition acts on specific areas. The proposed multi-subject cross-attention orchestrates harmonious inter-subject compositions while preserving text control. Comprehensive experiments show that MS-Diffusion outperforms existing models in both image and text fidelity, promoting personalized text-to-image generation. The framework introduces a grounding resampler to extract subject features and integrate them with grounding information. It also proposes a novel multi-subject cross-attention mechanism that confines subjects to specific areas, reducing conflicts between text and image conditions. The model is trained without pre-trained weights and remains plug-and-play across various base models. MS-Diffusion effectively handles multi-subject personalization by maintaining natural interactions among subjects and preserving their distinctiveness and recognizability. Quantitative results demonstrate its superior performance in DINO, M-DINO, and CLIP-T metrics. The model also shows strong image fidelity, highlighting its ability to retain details, especially in cases of low text fidelity. The framework addresses the challenges of multi-subject personalization by incorporating explicit layout guidance, which helps in resolving conflicts and ensuring accurate subject representation. The model's ability to generate images that adhere to layout conditions, even in cases of repeated subjects, demonstrates its effectiveness. MS-Diffusion is compatible with existing controllable tools like ControlNet, enabling the generation of highly controllable images without fine-tuning. The model's integration with ControlNet allows for the synthesis of images under varied structural directives, showcasing its versatility and effectiveness in multi-subject personalization tasks.

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

11 Jun 2024 | X. Wang, Siming Fu, Qihan Huang, Wanggui He, Hao Jiang