21 Mar 2024 | Pablo Marcos-Manchón, Roberto Alcover-Couso, Juan C. SanMiguel, José M. Martínez
This paper introduces Open-Vocabulary Attention Maps (OVAM), a training-free method for text-to-image diffusion models that enables the generation of attention maps for any word. The method allows for the creation of semantic segmentation masks based on open-vocabulary descriptions, regardless of the words in the text prompts used for image generation. Additionally, the paper proposes a lightweight optimization process based on OVAM for finding tokens that generate accurate attention maps for an object class with a single annotation.
OVAM is evaluated within existing state-of-the-art Stable Diffusion extensions. The best-performing model improves its mIoU from 52.1 to 86.6 for the synthetic images' pseudo-masks, demonstrating that the optimized tokens are an efficient way to improve the performance of existing methods without architectural changes or retraining. The implementation is available at github.com/vpulab/ovam.
The paper also presents a detailed methodology for OVAM, including cross-attention formulation, open-vocabulary attention maps, token optimization via OVAM, and mask binarization. The results show that OVAM with token optimization outperforms models that require additional training, even though it relies on only a single annotation per class. The performance of Grounded Diffusion is noteworthy, but it is skewed due to lower performance in several classes, which is addressed in the subsequent experiment.
The paper also presents an ablation study to understand the impact of different components in OVAM, including post-processing effects, layer selection, and time step selection. The results show that aggregating across all time steps yields the best performance. Moreover, when token optimization is used, a similar level of performance can be achieved by only extracting attentions from t = 12, located at the midpoint of the diffusion process.
The paper concludes that OVAM not only enhances existing diffusion-based segmentation methods but also serves as a valuable approach for synthetic data generation to train robust semantic segmentation models. The findings affirm the viability of OVAM in enhancing existing diffusion-based segmentation methods and as a valuable approach for synthetic data generation to train robust semantic segmentation models.This paper introduces Open-Vocabulary Attention Maps (OVAM), a training-free method for text-to-image diffusion models that enables the generation of attention maps for any word. The method allows for the creation of semantic segmentation masks based on open-vocabulary descriptions, regardless of the words in the text prompts used for image generation. Additionally, the paper proposes a lightweight optimization process based on OVAM for finding tokens that generate accurate attention maps for an object class with a single annotation.
OVAM is evaluated within existing state-of-the-art Stable Diffusion extensions. The best-performing model improves its mIoU from 52.1 to 86.6 for the synthetic images' pseudo-masks, demonstrating that the optimized tokens are an efficient way to improve the performance of existing methods without architectural changes or retraining. The implementation is available at github.com/vpulab/ovam.
The paper also presents a detailed methodology for OVAM, including cross-attention formulation, open-vocabulary attention maps, token optimization via OVAM, and mask binarization. The results show that OVAM with token optimization outperforms models that require additional training, even though it relies on only a single annotation per class. The performance of Grounded Diffusion is noteworthy, but it is skewed due to lower performance in several classes, which is addressed in the subsequent experiment.
The paper also presents an ablation study to understand the impact of different components in OVAM, including post-processing effects, layer selection, and time step selection. The results show that aggregating across all time steps yields the best performance. Moreover, when token optimization is used, a similar level of performance can be achieved by only extracting attentions from t = 12, located at the midpoint of the diffusion process.
The paper concludes that OVAM not only enhances existing diffusion-based segmentation methods but also serves as a valuable approach for synthetic data generation to train robust semantic segmentation models. The findings affirm the viability of OVAM in enhancing existing diffusion-based segmentation methods and as a valuable approach for synthetic data generation to train robust semantic segmentation models.