[slides and audio] EmerDiff%3A Emerging Pixel-level Semantic Knowledge in Diffusion Models

**EmerDiff: Emerging Pixel-Level Semantic Knowledge in Diffusion Models** Diffusion models have shown remarkable transfer abilities in semantic segmentation tasks, but generating fine-grained segmentation masks often requires additional training on annotated datasets. This paper addresses the question of whether pre-trained diffusion models alone understand the semantic relations of their generated images. Leveraging semantic knowledge extracted from Stable Diffusion (SD), the authors develop an unsupervised image segmenter capable of generating fine-grained segmentation maps without additional training. The primary challenge is that semantically meaningful feature maps typically exist only in low-dimensional spatial layers, making it difficult to extract pixel-level semantic relations directly from these maps. To overcome this, the framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD's generation process and uses them to construct image-resolution segmentation maps. Extensive experiments demonstrate that the produced segmentation maps are well-delineated and capture detailed parts of the images, indicating the presence of highly accurate pixel-level semantic knowledge in diffusion models. The paper is structured into sections covering the introduction, related work, methods, experiments, and limitations. It begins by discussing the current state of generative models for semantic segmentation and the challenges in unsupervised semantic segmentation. The methods section details the construction of low-resolution segmentation maps using k-means on semantically meaningful low-dimensional feature maps and the generation of image-resolution segmentation maps by mapping each pixel to the most semantically corresponding low-resolution mask. The experiments section evaluates the effectiveness of the framework on multiple scene-centric datasets, showing that the produced segmentation maps align well with detailed parts of the images. The limitations section highlights the framework's struggle with distinguishing extremely small objects and its potential for application to various generative models.**EmerDiff: Emerging Pixel-Level Semantic Knowledge in Diffusion Models** Diffusion models have shown remarkable transfer abilities in semantic segmentation tasks, but generating fine-grained segmentation masks often requires additional training on annotated datasets. This paper addresses the question of whether pre-trained diffusion models alone understand the semantic relations of their generated images. Leveraging semantic knowledge extracted from Stable Diffusion (SD), the authors develop an unsupervised image segmenter capable of generating fine-grained segmentation maps without additional training. The primary challenge is that semantically meaningful feature maps typically exist only in low-dimensional spatial layers, making it difficult to extract pixel-level semantic relations directly from these maps. To overcome this, the framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD's generation process and uses them to construct image-resolution segmentation maps. Extensive experiments demonstrate that the produced segmentation maps are well-delineated and capture detailed parts of the images, indicating the presence of highly accurate pixel-level semantic knowledge in diffusion models. The paper is structured into sections covering the introduction, related work, methods, experiments, and limitations. It begins by discussing the current state of generative models for semantic segmentation and the challenges in unsupervised semantic segmentation. The methods section details the construction of low-resolution segmentation maps using k-means on semantically meaningful low-dimensional feature maps and the generation of image-resolution segmentation maps by mapping each pixel to the most semantically corresponding low-resolution mask. The experiments section evaluates the effectiveness of the framework on multiple scene-centric datasets, showing that the produced segmentation maps align well with detailed parts of the images. The limitations section highlights the framework's struggle with distinguishing extremely small objects and its potential for application to various generative models.

EMERDIFF: EMERGING PIXEL-LEVEL SEMANTIC KNOWLEDGE IN DIFFUSION MODELS

22 Jan 2024 | Koichi Namekata, Amirmojtaba Sabour, Sanja Fidler, Seung Wook Kim