[slides] Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

The paper introduces Open-Vocabulary Attention Maps (OVAM), a training-free extension for text-to-image diffusion models that enables the generation of attention maps based on open-vocabulary descriptions. OVAM overcomes the limitations of existing methods, which are constrained by the words contained in the prompt. The authors propose a token optimization process to enhance the creation of accurate attention maps, improving the performance of semantic segmentation methods based on diffusion attentions. The evaluation demonstrates that OVAM can improve the mIoU from 52.1 to 86.6 for synthetic images' pseudo-masks, showing that optimized tokens are an efficient way to enhance existing methods without architectural changes or retraining. The implementation is available at github.com/vpulab/ovam. Diffusion models have advanced text-to-image (T2I) generation, with models like Stable Diffusion successfully extending to joint generation of semantic segmentation pseudo-masks. However, current methods rely on extracting attentions linked to prompt words, limiting the generation of masks derived from words not in the text prompt. OVAM generalizes these matrices and eliminates this constraint, enabling the creation of semantic segmentation masks described by an open vocabulary. The token optimization process learns open-vocabulary tokens that generate accurate attention maps for segmenting an object class with a single annotation per class, enhancing the quality of segmentation masks and improving the performance of existing methods. The paper evaluates OVAM and its token optimization process using synthetic datasets and real-world datasets. The results show that OVAM-generated pseudo-masks outperform other methods, achieving a significant improvement in mIoU. The token optimization further enhances the performance of these methods, demonstrating the practical value of OVAM-generated synthetic data for training semantic segmentation models. The experiments also show that OVAM can improve model performance by up to 6.9% when combined with real data. The paper concludes by affirming the viability of OVAM in enhancing existing diffusion-based segmentation methods and as a valuable approach for synthetic data generation to train robust semantic segmentation models.The paper introduces Open-Vocabulary Attention Maps (OVAM), a training-free extension for text-to-image diffusion models that enables the generation of attention maps based on open-vocabulary descriptions. OVAM overcomes the limitations of existing methods, which are constrained by the words contained in the prompt. The authors propose a token optimization process to enhance the creation of accurate attention maps, improving the performance of semantic segmentation methods based on diffusion attentions. The evaluation demonstrates that OVAM can improve the mIoU from 52.1 to 86.6 for synthetic images' pseudo-masks, showing that optimized tokens are an efficient way to enhance existing methods without architectural changes or retraining. The implementation is available at github.com/vpulab/ovam. Diffusion models have advanced text-to-image (T2I) generation, with models like Stable Diffusion successfully extending to joint generation of semantic segmentation pseudo-masks. However, current methods rely on extracting attentions linked to prompt words, limiting the generation of masks derived from words not in the text prompt. OVAM generalizes these matrices and eliminates this constraint, enabling the creation of semantic segmentation masks described by an open vocabulary. The token optimization process learns open-vocabulary tokens that generate accurate attention maps for segmenting an object class with a single annotation per class, enhancing the quality of segmentation masks and improving the performance of existing methods. The paper evaluates OVAM and its token optimization process using synthetic datasets and real-world datasets. The results show that OVAM-generated pseudo-masks outperform other methods, achieving a significant improvement in mIoU. The token optimization further enhances the performance of these methods, demonstrating the practical value of OVAM-generated synthetic data for training semantic segmentation models. The experiments also show that OVAM can improve model performance by up to 6.9% when combined with real data. The paper concludes by affirming the viability of OVAM in enhancing existing diffusion-based segmentation methods and as a valuable approach for synthetic data generation to train robust semantic segmentation models.

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

21 Mar 2024 | Pablo Marcos-Manchón, Roberto Alcover-Cousó, Juan C. SanMiguel, José M. Martínez