2024-03-09 | Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, Yonatan Belinkov
This paper introduces the DIFFUSION LENS, a method for analyzing the text encoder in text-to-image (T2I) diffusion models. The text encoder converts text prompts into latent representations that guide image generation. The DIFFUSION LENS uses intermediate representations from various layers of the text encoder to generate images, revealing insights into how the encoder processes text. The method relies solely on pre-trained model weights and does not require external modules.
The study analyzes two popular T2I models: Stable Diffusion (SD) and Deep Floyd (DF). The analysis focuses on two aspects: conceptual combination and memory retrieval. Conceptual combination refers to how the encoder combines multiple concepts to form a composite concept. The study finds that complex prompts require more computation and that relationships between objects emerge gradually in later layers. The order in which objects appear is influenced by their linear or syntactic precedence in the prompt.
Memory retrieval involves how the encoder retrieves and represents information about concepts. The study finds that common concepts emerge earlier in the encoder, while uncommon concepts emerge gradually across layers. Fine details, such as human facial features, appear in later layers. Knowledge retrieval is gradual, with representations becoming more accurate as computation progresses. This contrasts with prior research that suggests knowledge is localized in specific layers.
The study also identifies two types of failures in T2I models: (1) the model fails to combine concepts correctly, and (2) the model successfully combines concepts but fails to integrate them into a final image. These failures are attributed to biases in the model's representation of certain concepts.
The DIFFUSION LENS provides a new method for analyzing the text encoder in T2I models, revealing insights into how the encoder processes text and generates images. The findings suggest that factors such as architecture, pretraining objectives, and data may influence the encoding of knowledge or language representation within the models. The study contributes to a growing body of work on analyzing how models process information across various components. The DIFFUSION LENS may have many potential applications, including improving model efficiency and tracing factual associations in language models.This paper introduces the DIFFUSION LENS, a method for analyzing the text encoder in text-to-image (T2I) diffusion models. The text encoder converts text prompts into latent representations that guide image generation. The DIFFUSION LENS uses intermediate representations from various layers of the text encoder to generate images, revealing insights into how the encoder processes text. The method relies solely on pre-trained model weights and does not require external modules.
The study analyzes two popular T2I models: Stable Diffusion (SD) and Deep Floyd (DF). The analysis focuses on two aspects: conceptual combination and memory retrieval. Conceptual combination refers to how the encoder combines multiple concepts to form a composite concept. The study finds that complex prompts require more computation and that relationships between objects emerge gradually in later layers. The order in which objects appear is influenced by their linear or syntactic precedence in the prompt.
Memory retrieval involves how the encoder retrieves and represents information about concepts. The study finds that common concepts emerge earlier in the encoder, while uncommon concepts emerge gradually across layers. Fine details, such as human facial features, appear in later layers. Knowledge retrieval is gradual, with representations becoming more accurate as computation progresses. This contrasts with prior research that suggests knowledge is localized in specific layers.
The study also identifies two types of failures in T2I models: (1) the model fails to combine concepts correctly, and (2) the model successfully combines concepts but fails to integrate them into a final image. These failures are attributed to biases in the model's representation of certain concepts.
The DIFFUSION LENS provides a new method for analyzing the text encoder in T2I models, revealing insights into how the encoder processes text and generates images. The findings suggest that factors such as architecture, pretraining objectives, and data may influence the encoding of knowledge or language representation within the models. The study contributes to a growing body of work on analyzing how models process information across various components. The DIFFUSION LENS may have many potential applications, including improving model efficiency and tracing factual associations in language models.