Understanding AtP*%3A An efficient and scalable method for localizing LLM behaviour to components

**AtP*: An Efficient and Scalable Method for Localizing LLM Behavior to Components** János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda (Google DeepMind) **Abstract:** Activation Patching is a method for computing causal attributions of model behavior to individual components. However, applying it exhaustively is computationally expensive, especially for state-of-the-art Large Language Models (LLMs). Attribution Patching (AtP) is a fast gradient-based approximation to Activation Patching, but it has two classes of failure modes that lead to significant false negatives. We propose AtP*, a variant of AtP that addresses these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods, showing that AtP significantly outperforms all other methods, with AtP* providing further improvements. We also provide a method to bound the probability of remaining false negatives in AtP* estimates. **Introduction:** As LLMs become more prevalent, understanding their internal mechanisms is crucial for mechanistic interpretability. Attribution methods, such as Activation Patching, help identify which parts of the model contribute to specific behaviors. However, applying these methods to large models can be prohibitively expensive due to the high number of model components. AtP is a faster, approximate method that can be used as a prefiltering step to identify nodes with significant contributions, followed by more reliable verification using Activation Patching. **Contributions:** - We identify and address two common failure modes of AtP: attention saturation and cancellation between direct and indirect effects. - We propose AtP*, which includes corrections for attention saturation and a method called GradDrop to mitigate cancellation issues. - We provide a method to bound the probability of false negatives in AtP* estimates. - We compare AtP* with other methods and show its superior performance in terms of cost and accuracy. **Methods:** - **AtP* Improvements:** We address attention saturation by explicitly recomputing the attention softmax and using linear approximation for other nodes. - **GradDrop:** We modify backpropagation to disrupt cancellation by zeroing gradients at downstream layers that contribute to indirect effects. - **Diagnostics:** We use subset sampling to obtain upper confidence bounds on the effect magnitudes of nodes that might be missed by AtP*. **Experiments:** - We evaluate AtP* on various models and distributions, showing its effectiveness in identifying causally important nodes. - We compare AtP* with other methods, including iterative activation patching and subsampling, demonstrating its superior performance in terms of cost and accuracy. **Discussion:** - We discuss limitations and future directions, including the choice of nodes, the applicability to different LLMs, and extensions to edge patching and coarser nodes.**AtP*: An Efficient and Scalable Method for Localizing LLM Behavior to Components** János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda (Google DeepMind) **Abstract:** Activation Patching is a method for computing causal attributions of model behavior to individual components. However, applying it exhaustively is computationally expensive, especially for state-of-the-art Large Language Models (LLMs). Attribution Patching (AtP) is a fast gradient-based approximation to Activation Patching, but it has two classes of failure modes that lead to significant false negatives. We propose AtP*, a variant of AtP that addresses these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods, showing that AtP significantly outperforms all other methods, with AtP* providing further improvements. We also provide a method to bound the probability of remaining false negatives in AtP* estimates. **Introduction:** As LLMs become more prevalent, understanding their internal mechanisms is crucial for mechanistic interpretability. Attribution methods, such as Activation Patching, help identify which parts of the model contribute to specific behaviors. However, applying these methods to large models can be prohibitively expensive due to the high number of model components. AtP is a faster, approximate method that can be used as a prefiltering step to identify nodes with significant contributions, followed by more reliable verification using Activation Patching. **Contributions:** - We identify and address two common failure modes of AtP: attention saturation and cancellation between direct and indirect effects. - We propose AtP*, which includes corrections for attention saturation and a method called GradDrop to mitigate cancellation issues. - We provide a method to bound the probability of false negatives in AtP* estimates. - We compare AtP* with other methods and show its superior performance in terms of cost and accuracy. **Methods:** - **AtP* Improvements:** We address attention saturation by explicitly recomputing the attention softmax and using linear approximation for other nodes. - **GradDrop:** We modify backpropagation to disrupt cancellation by zeroing gradients at downstream layers that contribute to indirect effects. - **Diagnostics:** We use subset sampling to obtain upper confidence bounds on the effect magnitudes of nodes that might be missed by AtP*. **Experiments:** - We evaluate AtP* on various models and distributions, showing its effectiveness in identifying causally important nodes. - We compare AtP* with other methods, including iterative activation patching and subsampling, demonstrating its superior performance in terms of cost and accuracy. **Discussion:** - We discuss limitations and future directions, including the choice of nodes, the applicability to different LLMs, and extensions to edge patching and coarser nodes.

AtP*: An efficient and scalable method for localizing LLM behaviour to components

2024-02-23 | János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda