[slides] Be Yourself%3A Bounded Attention for Multi-Subject Text-to-Image Generation

The paper "Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation" addresses the challenge of generating images with multiple subjects, especially when these subjects are semantically or visually similar. The authors identify that the primary issue lies in the semantic leakage between subjects during the denoising process, which is caused by the diffusion model's attention layers. To mitigate this problem, they introduce Bounded Attention, a training-free method that controls the information flow in the sampling process, preventing semantic leakage and ensuring each subject retains its distinct characteristics. The method operates in two modes: Bounded Guidance and Bounded Denoising. Bounded Guidance uses a loss function to steer the latent signal towards the desired layout, while Bounded Denoising applies masks to reduce semantic leakage and prevent unintended semantics from leaking to the background. The authors demonstrate that their method effectively generates images with multiple subjects, maintaining their individuality and aligning with the input prompt and layout. Extensive experiments on the DrawBench dataset and user studies show that Bounded Attention outperforms existing methods in terms of semantic alignment and image quality. The paper also includes a detailed analysis of the root causes of semantic leakage, ablation studies, and comparisons with other training-free and trained methods. The results highlight the effectiveness of Bounded Attention in generating complex and diverse images with multiple subjects, even in challenging scenarios.The paper "Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation" addresses the challenge of generating images with multiple subjects, especially when these subjects are semantically or visually similar. The authors identify that the primary issue lies in the semantic leakage between subjects during the denoising process, which is caused by the diffusion model's attention layers. To mitigate this problem, they introduce Bounded Attention, a training-free method that controls the information flow in the sampling process, preventing semantic leakage and ensuring each subject retains its distinct characteristics. The method operates in two modes: Bounded Guidance and Bounded Denoising. Bounded Guidance uses a loss function to steer the latent signal towards the desired layout, while Bounded Denoising applies masks to reduce semantic leakage and prevent unintended semantics from leaking to the background. The authors demonstrate that their method effectively generates images with multiple subjects, maintaining their individuality and aligning with the input prompt and layout. Extensive experiments on the DrawBench dataset and user studies show that Bounded Attention outperforms existing methods in terms of semantic alignment and image quality. The paper also includes a detailed analysis of the root causes of semantic leakage, ablation studies, and comparisons with other training-free and trained methods. The results highlight the effectiveness of Bounded Attention in generating complex and diverse images with multiple subjects, even in challenging scenarios.

Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

25 Mar 2024 | Omer Dahary, Or Patashnik, Kfir Aberman, Daniel Cohen-Or