EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

1University of Toronto, 2Vector Institute, 3NVIDIA
ICLR, 2024

EmerDiff is an unsupervised image segmentor solely built on the semantic knowledge extracted from a pre-trained diffusion model. The obtained fine-detailed segmentation maps suggest the presence of highly accurate pixel-level semantic knowledge in diffusion models.

Abstract

Diffusion models have recently received increasing research attention for their remarkable transfer abilities in semantic segmentation tasks. However, generating fine-grained segmentation masks with diffusion models often requires additional training on annotated datasets, leaving it unclear to what extent pre-trained diffusion models alone understand the semantic relations of their generated images. To address this question, we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. The primary difficulty stems from the fact that semantically meaningful feature maps typically exist only in the spatially lower-dimensional layers, which poses a challenge in directly extracting pixel-level semantic relations from these feature maps. To overcome this issue, our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps by exploiting SD's generation process and utilizes them for constructing image-resolution segmentation maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images, indicating the existence of highly accurate pixel-level semantic knowledge in diffusion models.

Key observation

We begin by investigating how a local change in the values of low-resolution feature maps (e.g. 16×16) influences the pixel values of the generated images (e.g. 512×512). We discover that when we perturb the values of a sub-region of low-resolution feature maps (middle row of the figure below), the generated images are altered in a way that only the pixels semantically related to that sub-region are changed notably (bottom row of the figure).

Baselines vs. RGB

Observation. First row: original image. Second row: local change in low-resolution feature maps (e.g. 16×16). Third row: resultant change in final generated images (e.g. 512×512).


Following the above observation, we can automatically identify the semantic correspondences between image pixels and a sub-region of low-dimensional feature maps by simply measuring the change in the pixel values.

Pipeline

EmerDiff generates fine-grained segmentation maps in an unsupervised manner. First, we generate low-resolution segmentation maps (e.g. 16×16) by applying k-means on the low-dimensional feature maps (green part of the figure below). Then, we build image-resolution segmentation maps (e.g. 512×512) in a top-down manner by mapping each image pixel to the most semantically corresponding low-resolution mask (orange part of the figure). These semantic correspondences are extracted from the diffusion models via Modulated Denoising Process leveraging the above observation (red part).

unsupervised results

Results

naive comparison Qualitative comparison with naively upsampled low-resolution segmentation maps. Our segmentation maps are fine-grained and precisely capture detailed parts of the objects.

unsupervised results Varying the number of segmentation masks. Our framework consistently groups objects in a semantically meaningful manner.

Unsupervised semantic segmentation

Here we apply EmerDiff to unsupervised semantic segmentation. For more analysis, please refer to the paper.

unsupervised results

BibTeX


      @article{namekata2024emerdiff,
      	title={EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models}, 
      	author={Koichi Namekata and Amirmojtaba Sabour and Sanja Fidler and Seung Wook Kim},
      	journal={arXiv:2401.11739},
      	year={2024}
      }