[slides and audio] MoA%3A Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

The paper introduces Mixture-of-Attention (MoA), a novel architecture for personalized image generation that enhances the capabilities of text-to-image diffusion models. MoA is designed to generate images with multiple subjects in a fixed context and composition without predefined layouts, while minimizing the intervention of personalized elements in the generation process. The architecture consists of two attention pathways: a personalized branch and a non-personalized prior branch. The prior branch retains the original model's capabilities, while the personalized branch learns to embed subjects into the generated layout and context. A routing mechanism manages the distribution of pixels across these branches, optimizing the blend of personalized and generic content. MoA enables high-quality, personalized images with diverse compositions and interactions, preserving the model's pre-existing capabilities and offering disentangled control over subject and context. The method is compatible with existing diffusion-based image generation and editing techniques, such as ControlNet and inversion methods, and demonstrates applications like subject swapping, morphing, and style transfer. The paper also discusses limitations, such as challenges with small faces and complex scenarios, and suggests future directions for further improvements.The paper introduces Mixture-of-Attention (MoA), a novel architecture for personalized image generation that enhances the capabilities of text-to-image diffusion models. MoA is designed to generate images with multiple subjects in a fixed context and composition without predefined layouts, while minimizing the intervention of personalized elements in the generation process. The architecture consists of two attention pathways: a personalized branch and a non-personalized prior branch. The prior branch retains the original model's capabilities, while the personalized branch learns to embed subjects into the generated layout and context. A routing mechanism manages the distribution of pixels across these branches, optimizing the blend of personalized and generic content. MoA enables high-quality, personalized images with diverse compositions and interactions, preserving the model's pre-existing capabilities and offering disentangled control over subject and context. The method is compatible with existing diffusion-based image generation and editing techniques, such as ControlNet and inversion methods, and demonstrates applications like subject swapping, morphing, and style transfer. The paper also discusses limitations, such as challenges with small faces and complex scenarios, and suggests future directions for further improvements.

MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

6 May 2024 | KUAN-CHIEH (JACKSON) WANG, Snap Inc., USA; DANIIL OSTASHEV, Snap Inc., UK; YUWEI FANG, Snap Inc., USA; SERGEY TULYAKOV, Snap Inc., USA; KFIR ABERMAN, Snap Inc., USA