Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

25 Mar 2024 | Omer Dahary, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
Bounded Attention is a training-free method designed to control multi-subject text-to-image generation by limiting the influence of attention mechanisms. The method addresses the issue of semantic leakage between subjects in diffusion models, where attention layers blend visual features of different subjects, leading to inaccurate image generation. Bounded Attention bounds the information flow during the denoising process, ensuring each subject retains its individuality. This approach is effective in generating complex layouts with multiple similar subjects, such as kittens with different colors, by preventing feature leakage between them. The method is implemented in two modes: Bounded Guidance and Bounded Denoising. Bounded Guidance minimizes a loss that encourages each subject's attention to stay within its bounding box, while Bounded Denoising confines attention to the subject's bounding box and background. The method is tested on Stable Diffusion and SDXL, demonstrating superior performance in generating accurate and semantically aligned images compared to existing methods. The approach is effective in reducing semantic leakage and ensuring each subject maintains its distinct characteristics, even in complex layouts. The method is also validated through user studies and quantitative evaluations, showing significant improvements in counting accuracy and spatial alignment. Bounded Attention provides a robust solution for multi-subject text-to-image generation, enabling precise control over the generated images.Bounded Attention is a training-free method designed to control multi-subject text-to-image generation by limiting the influence of attention mechanisms. The method addresses the issue of semantic leakage between subjects in diffusion models, where attention layers blend visual features of different subjects, leading to inaccurate image generation. Bounded Attention bounds the information flow during the denoising process, ensuring each subject retains its individuality. This approach is effective in generating complex layouts with multiple similar subjects, such as kittens with different colors, by preventing feature leakage between them. The method is implemented in two modes: Bounded Guidance and Bounded Denoising. Bounded Guidance minimizes a loss that encourages each subject's attention to stay within its bounding box, while Bounded Denoising confines attention to the subject's bounding box and background. The method is tested on Stable Diffusion and SDXL, demonstrating superior performance in generating accurate and semantically aligned images compared to existing methods. The approach is effective in reducing semantic leakage and ensuring each subject maintains its distinct characteristics, even in complex layouts. The method is also validated through user studies and quantitative evaluations, showing significant improvements in counting accuracy and spatial alignment. Bounded Attention provides a robust solution for multi-subject text-to-image generation, enabling precise control over the generated images.
Reach us at info@study.space