This paper proposes INITNO, a method for improving text-to-image diffusion models by optimizing initial noise. The core idea is to refine initial noise in the latent space to generate images that align well with text prompts. The method identifies valid and invalid regions in the initial latent space based on cross-attention and self-attention maps. These maps help evaluate the initial noise, with valid regions containing noise that can produce semantically accurate images. The method then guides the initial noise towards valid regions through a noise optimization pipeline.
The paper introduces two scores: the cross-attention response score and the self-attention conflict score. The cross-attention response score measures how well the generated image aligns with the text prompt, while the self-attention conflict score identifies overlapping attention maps that may lead to subject mixing. By setting thresholds for these scores, the initial latent space is partitioned into valid and invalid regions.
The noise optimization pipeline uses a distribution alignment loss to ensure the optimized noise adheres to the initial distribution. This approach avoids the trade-off between under-optimization and over-optimization, leading to more accurate and realistic image generation. The method is validated through experiments showing improved performance compared to existing approaches. It is also shown to be effective in generating images for complex prompts and can be integrated into existing diffusion models for training-free controllable generation. The method is efficient and effective, with results demonstrating high-quality image generation that aligns well with text prompts.This paper proposes INITNO, a method for improving text-to-image diffusion models by optimizing initial noise. The core idea is to refine initial noise in the latent space to generate images that align well with text prompts. The method identifies valid and invalid regions in the initial latent space based on cross-attention and self-attention maps. These maps help evaluate the initial noise, with valid regions containing noise that can produce semantically accurate images. The method then guides the initial noise towards valid regions through a noise optimization pipeline.
The paper introduces two scores: the cross-attention response score and the self-attention conflict score. The cross-attention response score measures how well the generated image aligns with the text prompt, while the self-attention conflict score identifies overlapping attention maps that may lead to subject mixing. By setting thresholds for these scores, the initial latent space is partitioned into valid and invalid regions.
The noise optimization pipeline uses a distribution alignment loss to ensure the optimized noise adheres to the initial distribution. This approach avoids the trade-off between under-optimization and over-optimization, leading to more accurate and realistic image generation. The method is validated through experiments showing improved performance compared to existing approaches. It is also shown to be effective in generating images for complex prompts and can be integrated into existing diffusion models for training-free controllable generation. The method is efficient and effective, with results demonstrating high-quality image generation that aligns well with text prompts.