EMERDIFF: EMERGING PIXEL-LEVEL SEMANTIC KNOWLEDGE IN DIFFUSION MODELS

EMERDIFF: EMERGING PIXEL-LEVEL SEMANTIC KNOWLEDGE IN DIFFUSION MODELS

2024 | Koichi Namekata, Amirmojtaba Sabour, Sanja Fidler, Seung Wook Kim
EMERDIFF: EMERGING PIXEL-LEVEL SEMANTIC KNOWLEDGE IN DIFFUSION MODELS This paper presents EMERDIFF, an unsupervised image segmentor that generates fine-grained segmentation maps using semantic knowledge extracted from pre-trained diffusion models. The key challenge is to extract pixel-level semantic relations from low-dimensional feature maps, which are typically found in spatially lower-dimensional layers of diffusion models. To address this, the framework identifies semantic correspondences between image pixels and low-dimensional feature maps by leveraging the diffusion model's generation process. It first generates low-resolution segmentation maps using k-means on low-dimensional feature maps, then constructs image-resolution segmentation maps by mapping each pixel to the most semantically corresponding low-resolution mask. The effectiveness of the framework is validated on multiple scene-centric datasets, demonstrating that diffusion models possess highly accurate pixel-level semantic knowledge. The framework is built on Stable Diffusion, a large-scale text-conditioned diffusion model capable of generating diverse high-resolution images. The method involves modulating the values of a sub-region of low-resolution feature maps and observing the resulting changes in generated images. This allows the automatic identification of semantic correspondences between image pixels and sub-regions of low-dimensional feature maps. The framework produces segmentation maps that are well-delineated and capture detailed parts of images, indicating the presence of accurate pixel-level semantic knowledge in diffusion models. The framework is evaluated both qualitatively and quantitatively on standard segmentation datasets, showing that it performs on par with recent DINO-based baselines. It also outperforms other baselines in annotation-free open-vocabulary semantic segmentation. The method is effective in generating fine-grained segmentation maps without additional training or annotations, leveraging the semantic knowledge embedded in diffusion models. The framework is also applied to existing annotation-free open-vocabulary segmentation models, demonstrating its effectiveness in producing class-aware fine-grained segmentation maps. The study highlights the potential of diffusion models for semantic segmentation tasks and encourages further research into utilizing generative models for discriminative tasks.EMERDIFF: EMERGING PIXEL-LEVEL SEMANTIC KNOWLEDGE IN DIFFUSION MODELS This paper presents EMERDIFF, an unsupervised image segmentor that generates fine-grained segmentation maps using semantic knowledge extracted from pre-trained diffusion models. The key challenge is to extract pixel-level semantic relations from low-dimensional feature maps, which are typically found in spatially lower-dimensional layers of diffusion models. To address this, the framework identifies semantic correspondences between image pixels and low-dimensional feature maps by leveraging the diffusion model's generation process. It first generates low-resolution segmentation maps using k-means on low-dimensional feature maps, then constructs image-resolution segmentation maps by mapping each pixel to the most semantically corresponding low-resolution mask. The effectiveness of the framework is validated on multiple scene-centric datasets, demonstrating that diffusion models possess highly accurate pixel-level semantic knowledge. The framework is built on Stable Diffusion, a large-scale text-conditioned diffusion model capable of generating diverse high-resolution images. The method involves modulating the values of a sub-region of low-resolution feature maps and observing the resulting changes in generated images. This allows the automatic identification of semantic correspondences between image pixels and sub-regions of low-dimensional feature maps. The framework produces segmentation maps that are well-delineated and capture detailed parts of images, indicating the presence of accurate pixel-level semantic knowledge in diffusion models. The framework is evaluated both qualitatively and quantitatively on standard segmentation datasets, showing that it performs on par with recent DINO-based baselines. It also outperforms other baselines in annotation-free open-vocabulary semantic segmentation. The method is effective in generating fine-grained segmentation maps without additional training or annotations, leveraging the semantic knowledge embedded in diffusion models. The framework is also applied to existing annotation-free open-vocabulary segmentation models, demonstrating its effectiveness in producing class-aware fine-grained segmentation maps. The study highlights the potential of diffusion models for semantic segmentation tasks and encourages further research into utilizing generative models for discriminative tasks.
Reach us at info@study.space
[slides and audio] EmerDiff%3A Emerging Pixel-level Semantic Knowledge in Diffusion Models