On Mechanistic Knowledge Localization in Text-to-Image Generative Models

On Mechanistic Knowledge Localization in Text-to-Image Generative Models

2024 | Samyadeep Basu, Keivan Rezaei, Priyatham Kattakinda, Ryan Rossi, Cherry Zhao, Vlad Morariu, Varun Manjunatha, Soheil Feizi
This paper presents a method for mechanistic knowledge localization in text-to-image generative models, enabling efficient model editing. The authors investigate how knowledge about visual attributes (e.g., style, objects, facts) is localized within the UNet of various text-to-image models. They find that while causal tracing can identify localized knowledge in early Stable-Diffusion variants, it fails for newer models like SD-XL and DeepFloyd. To address this, they introduce LOCOGEN, a method that identifies specific cross-attention layers in the UNet that control different visual attributes. LOCOGEN measures the direct effect of these layers on output generation by performing interventions in the cross-attention layers. Using LOCOGEN, the authors then introduce LOCOEDIT, a fast closed-form editing method that allows for efficient model editing across popular open-source text-to-image models. They demonstrate that knowledge about various visual attributes can be localized to a small subset of layers in the UNet, enabling targeted edits such as removing artistic styles, modifying trademarked objects, and updating outdated facts. The authors also explore neuron-level model editing, showing that knowledge about specific styles can be localized to a few neurons, and that modifying these neurons can effectively remove the associated style from generated images. The paper highlights the importance of mechanistic localization in understanding and editing text-to-image models. It shows that while causal tracing has limitations in identifying localized knowledge in newer models, mechanistic localization offers a more general and effective approach. The authors also discuss the implications of their findings for model interpretability and editing, emphasizing the potential for more precise and efficient model modifications. The results demonstrate that their methods can effectively identify and edit localized knowledge in various text-to-image models, offering a promising direction for future research in model interpretability and editing.This paper presents a method for mechanistic knowledge localization in text-to-image generative models, enabling efficient model editing. The authors investigate how knowledge about visual attributes (e.g., style, objects, facts) is localized within the UNet of various text-to-image models. They find that while causal tracing can identify localized knowledge in early Stable-Diffusion variants, it fails for newer models like SD-XL and DeepFloyd. To address this, they introduce LOCOGEN, a method that identifies specific cross-attention layers in the UNet that control different visual attributes. LOCOGEN measures the direct effect of these layers on output generation by performing interventions in the cross-attention layers. Using LOCOGEN, the authors then introduce LOCOEDIT, a fast closed-form editing method that allows for efficient model editing across popular open-source text-to-image models. They demonstrate that knowledge about various visual attributes can be localized to a small subset of layers in the UNet, enabling targeted edits such as removing artistic styles, modifying trademarked objects, and updating outdated facts. The authors also explore neuron-level model editing, showing that knowledge about specific styles can be localized to a few neurons, and that modifying these neurons can effectively remove the associated style from generated images. The paper highlights the importance of mechanistic localization in understanding and editing text-to-image models. It shows that while causal tracing has limitations in identifying localized knowledge in newer models, mechanistic localization offers a more general and effective approach. The authors also discuss the implications of their findings for model interpretability and editing, emphasizing the potential for more precise and efficient model modifications. The results demonstrate that their methods can effectively identify and edit localized knowledge in various text-to-image models, offering a promising direction for future research in model interpretability and editing.
Reach us at info@study.space