9 Mar 2024 | Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, Yonatan Belinkov
The paper introduces the DIFFUSION LENS, a method for analyzing the text encoder in text-to-image (T2I) models by generating images from intermediate representations. The method uses pre-trained model weights to guide the diffusion process, resulting in clear and consistent images. The authors apply DIFFUSION LENS to two popular T2I models, Stable Diffusion and Deep Floyd, to explore two main aspects: conceptual combination and memory retrieval.
For conceptual combination, the analysis reveals that complex scenes with multiple objects are composed progressively and more slowly compared to simple scenes. Early layers often act as a "bag of concepts," lacking relational information, while later layers capture these relationships more accurately. The order in which objects emerge is influenced by their linear or syntactic precedence, with Deep Floyd's T5 showing greater sensitivity to syntactic structure and Stable Diffusion's CLIP reflecting linear order.
For memory retrieval, the study finds that common concepts emerge early, while uncommon concepts gradually emerge across layers, with the most accurate representations in the upper layers. Fine details, such as human facial features, are refined at later stages. Knowledge retrieval is gradual, and there are differences in retrieval patterns between the two text encoders. Deep Floyd's T5 exhibits more incremental behavior, while Stable Diffusion's CLIP establishes representations less gradually.
The paper also discusses model failures, particularly in generating images from prompts describing two entities with different colors, revealing two types of failures: coupling a particular concept and color or successfully coupling each concept and color but failing to combine them.
The authors conclude that DIFFUSION LENS provides valuable insights into the text encoder component in T2I pipelines, contributing to a deeper understanding of the entire generation process.The paper introduces the DIFFUSION LENS, a method for analyzing the text encoder in text-to-image (T2I) models by generating images from intermediate representations. The method uses pre-trained model weights to guide the diffusion process, resulting in clear and consistent images. The authors apply DIFFUSION LENS to two popular T2I models, Stable Diffusion and Deep Floyd, to explore two main aspects: conceptual combination and memory retrieval.
For conceptual combination, the analysis reveals that complex scenes with multiple objects are composed progressively and more slowly compared to simple scenes. Early layers often act as a "bag of concepts," lacking relational information, while later layers capture these relationships more accurately. The order in which objects emerge is influenced by their linear or syntactic precedence, with Deep Floyd's T5 showing greater sensitivity to syntactic structure and Stable Diffusion's CLIP reflecting linear order.
For memory retrieval, the study finds that common concepts emerge early, while uncommon concepts gradually emerge across layers, with the most accurate representations in the upper layers. Fine details, such as human facial features, are refined at later stages. Knowledge retrieval is gradual, and there are differences in retrieval patterns between the two text encoders. Deep Floyd's T5 exhibits more incremental behavior, while Stable Diffusion's CLIP establishes representations less gradually.
The paper also discusses model failures, particularly in generating images from prompts describing two entities with different colors, revealing two types of failures: coupling a particular concept and color or successfully coupling each concept and color but failing to combine them.
The authors conclude that DIFFUSION LENS provides valuable insights into the text encoder component in T2I pipelines, contributing to a deeper understanding of the entire generation process.