[slides and audio] How to use and interpret activation patching

Activation patching is a mechanistic interpretability technique used to understand how neural networks process information. It involves replacing internal activations with those from a different input to observe how this affects the model's output. This technique is more targeted than ablation, which zeroes out activations, allowing for precise manipulation to identify specific model behaviors and circuits. Activation patching can be applied in two main directions: denoising and noising. Denoising involves replacing activations from a clean prompt with those from a corrupted one to see if they restore the model's behavior. Noising involves replacing activations from a corrupted prompt with those from a clean one to see if they maintain the model's behavior. These methods help in identifying which components of the model are responsible for specific tasks. The choice of prompts is crucial. Corrupted prompts can reveal which model components are important, but they must be chosen carefully to avoid unintended biases. Patching experiments should be designed to test different aspects of the model, such as circuits, and should consider both exploratory and confirmatory approaches. Exploratory experiments aim to find circuits, while confirmatory experiments verify the function of identified circuits. Interpreting patching results requires careful consideration of metrics. Logit differences are a useful metric as they measure the model's confidence in correct answers and are sensitive to changes in the residual stream. Other metrics, such as log probabilities and probabilities, can be less effective due to their non-linear nature and potential for saturation. Common pitfalls include using discrete metrics that may overrepresent changes at thresholds and not accounting for the non-linear effects of probabilities. It is important to use a range of metrics and to ensure that they are continuous and linear in logit space for accurate interpretation. Patching results should be analyzed using a comprehensive approach, including multiple metrics and visualizations, to ensure a thorough understanding of the model's behavior.Activation patching is a mechanistic interpretability technique used to understand how neural networks process information. It involves replacing internal activations with those from a different input to observe how this affects the model's output. This technique is more targeted than ablation, which zeroes out activations, allowing for precise manipulation to identify specific model behaviors and circuits. Activation patching can be applied in two main directions: denoising and noising. Denoising involves replacing activations from a clean prompt with those from a corrupted one to see if they restore the model's behavior. Noising involves replacing activations from a corrupted prompt with those from a clean one to see if they maintain the model's behavior. These methods help in identifying which components of the model are responsible for specific tasks. The choice of prompts is crucial. Corrupted prompts can reveal which model components are important, but they must be chosen carefully to avoid unintended biases. Patching experiments should be designed to test different aspects of the model, such as circuits, and should consider both exploratory and confirmatory approaches. Exploratory experiments aim to find circuits, while confirmatory experiments verify the function of identified circuits. Interpreting patching results requires careful consideration of metrics. Logit differences are a useful metric as they measure the model's confidence in correct answers and are sensitive to changes in the residual stream. Other metrics, such as log probabilities and probabilities, can be less effective due to their non-linear nature and potential for saturation. Common pitfalls include using discrete metrics that may overrepresent changes at thresholds and not accounting for the non-linear effects of probabilities. It is important to use a range of metrics and to ensure that they are continuous and linear in logit space for accurate interpretation. Patching results should be analyzed using a comprehensive approach, including multiple metrics and visualizations, to ensure a thorough understanding of the model's behavior.

How to use and interpret activation patching

23 Apr 2024 | Stefan Heimersheim, Neel Nanda