How to use and interpret activation patching

How to use and interpret activation patching

23 Apr 2024 | Stefan Heimersheim, Neel Nanda
Activation patching is a technique used to interpret the internal workings of neural networks by replacing internal activations with cached activations from previous runs. This method is more targeted and controlled compared to ablation, which involves zeroing out activations. The technique is widely used in the literature and has various names such as Causal Tracing, Resample Ablation, Interchange Intervention, and Causal Mediation Analysis. The document provides practical advice and best practices for using activation patching, focusing on three main areas: 1. **What kind of patching experiments provide which evidence?** It discusses exploratory and confirmatory experiments, different levels of granularity for patching, and the use of path patching to understand direct and mediated interactions. 2. **How should you interpret activation patching results?** It explains the differences between denoising (clean → corrupt) and noising (corrupt → clean) patching, and how these methods can be used to identify necessary and sufficient components in a circuit. 3. **What metrics can you use, and what are common pitfalls?** It emphasizes the importance of choosing appropriate metrics and warns against common pitfalls such as discrete metrics, overly sharp metrics, and metrics sensitive to unintended information. The document also includes a walkthrough of a hypothetical example to illustrate the concepts and provides recommendations for choosing corrupted prompts and interpreting patching results. It concludes with a summary of key points and acknowledges contributions from several researchers.Activation patching is a technique used to interpret the internal workings of neural networks by replacing internal activations with cached activations from previous runs. This method is more targeted and controlled compared to ablation, which involves zeroing out activations. The technique is widely used in the literature and has various names such as Causal Tracing, Resample Ablation, Interchange Intervention, and Causal Mediation Analysis. The document provides practical advice and best practices for using activation patching, focusing on three main areas: 1. **What kind of patching experiments provide which evidence?** It discusses exploratory and confirmatory experiments, different levels of granularity for patching, and the use of path patching to understand direct and mediated interactions. 2. **How should you interpret activation patching results?** It explains the differences between denoising (clean → corrupt) and noising (corrupt → clean) patching, and how these methods can be used to identify necessary and sufficient components in a circuit. 3. **What metrics can you use, and what are common pitfalls?** It emphasizes the importance of choosing appropriate metrics and warns against common pitfalls such as discrete metrics, overly sharp metrics, and metrics sensitive to unintended information. The document also includes a walkthrough of a hypothetical example to illustrate the concepts and provides recommendations for choosing corrupted prompts and interpreting patching results. It concludes with a summary of key points and acknowledges contributions from several researchers.
Reach us at info@study.space
[slides] How to use and interpret activation patching | StudySpace