MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation

6 May 2024 | KUAN-CHIEH (JACKSON) WANG, DANIIL OSTASHEV, YUWEI FANG, SERGEY TULYAKOV, KFIR ABERMAN
MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation This paper introduces Mixture-of-Attention (MoA), a new architecture for personalized image generation that enables the generation of images with subject-context disentanglement. MoA is designed to retain the original model's prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A novel routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model's pre-existing capability and the newly augmented personalized intervention, thereby offering a more disentangled subject-context control that was previously unattainable. MoA is inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs) and extends the vanilla attention mechanism into multiple attention blocks (i.e., experts), with a router network that softly combines the different experts. MoA distributes the generation between personalized and non-personalized attention pathways. It is designed to retain the original model's prior by fixing its attention layers in the prior (non-personalized) branch, while minimally intervening in the generation process with the personalized branch. The latter learns to embed subjects that are depicted in input images, via encoded visual tokens that are injected to the layout and context generated by the prior branch. This mechanism is enabled thanks to the router that blends the outputs of the personalized branch only at the subject pixels (i.e., foreground), by learning soft segmentation maps that dictate the distribution of the workload between the two branches. This mechanism frees us from the trade-off between identity preservation and prompt consistency. MoA enables new levels of disentangled control in personalized generative models, allowing for applications such as subject swap, subject morphing, style transfer, etc., that were previously challenging to attain. In addition, due to the existence of the fixed prior branch, MoA is compatible with many other diffusion-based image generation and editing techniques, such as ControlNet or inversion techniques that unlock a novel approach to easily replace subjects in a real image. The MoA architecture is compatible with existing image generation/editing techniques developed for diffusion-based models. It is simple and minimal in modification to the base diffusion model, making it naturally compatible with extensions like ControlNet. MoA can create new characters by interpolating between the image features of different subjects, which we refer to as subject morphing. Beyond generation, MoA is also compatible with real-image editing techniques based on diffusion inversion. It can be used in conjunction with DDIM Inversion to enable real-image editing. MoA is also capable of handling occlusion from objectsMoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation This paper introduces Mixture-of-Attention (MoA), a new architecture for personalized image generation that enables the generation of images with subject-context disentanglement. MoA is designed to retain the original model's prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A novel routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model's pre-existing capability and the newly augmented personalized intervention, thereby offering a more disentangled subject-context control that was previously unattainable. MoA is inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs) and extends the vanilla attention mechanism into multiple attention blocks (i.e., experts), with a router network that softly combines the different experts. MoA distributes the generation between personalized and non-personalized attention pathways. It is designed to retain the original model's prior by fixing its attention layers in the prior (non-personalized) branch, while minimally intervening in the generation process with the personalized branch. The latter learns to embed subjects that are depicted in input images, via encoded visual tokens that are injected to the layout and context generated by the prior branch. This mechanism is enabled thanks to the router that blends the outputs of the personalized branch only at the subject pixels (i.e., foreground), by learning soft segmentation maps that dictate the distribution of the workload between the two branches. This mechanism frees us from the trade-off between identity preservation and prompt consistency. MoA enables new levels of disentangled control in personalized generative models, allowing for applications such as subject swap, subject morphing, style transfer, etc., that were previously challenging to attain. In addition, due to the existence of the fixed prior branch, MoA is compatible with many other diffusion-based image generation and editing techniques, such as ControlNet or inversion techniques that unlock a novel approach to easily replace subjects in a real image. The MoA architecture is compatible with existing image generation/editing techniques developed for diffusion-based models. It is simple and minimal in modification to the base diffusion model, making it naturally compatible with extensions like ControlNet. MoA can create new characters by interpolating between the image features of different subjects, which we refer to as subject morphing. Beyond generation, MoA is also compatible with real-image editing techniques based on diffusion inversion. It can be used in conjunction with DDIM Inversion to enable real-image editing. MoA is also capable of handling occlusion from objects
Reach us at info@study.space