AtP*: An efficient and scalable method for localizing LLM behaviour to components

AtP*: An efficient and scalable method for localizing LLM behaviour to components

2024-02-23 | János Kramár¹, Tom Lieberum¹, Rohin Shah¹ and Neel Nanda¹
AtP* is an efficient and scalable method for localizing LLM behavior to components. Activation Patching computes causal attributions of behavior to model components, but is computationally expensive for large models. AtP is a faster, approximate method that can be used as a prefiltering step, followed by verification to filter out false positives. AtP* improves upon AtP by addressing two failure modes: attention saturation and cancellation between direct and indirect effects. AtP* recomputes the attention softmax and uses dropout on the backwards pass to fix brittle false negatives. It also provides a method to bound the probability of remaining false negatives. The paper presents a systematic study of AtP and alternative methods for faster activation patching, showing that AtP significantly outperforms other methods, with AtP* providing further improvement. It also introduces a diagnostic method to estimate the residual error of AtP* and statistically bound the sizes of any remaining false negatives. The paper discusses the performance of AtP* on different model sizes and distributions, showing that it is effective across a range of settings. It also compares AtP* with alternative methods such as subsampling and hierarchical grouping, showing that AtP* is more efficient and accurate. The paper concludes that AtP* is a promising method for localizing LLM behavior to components, with potential applications in various domains.AtP* is an efficient and scalable method for localizing LLM behavior to components. Activation Patching computes causal attributions of behavior to model components, but is computationally expensive for large models. AtP is a faster, approximate method that can be used as a prefiltering step, followed by verification to filter out false positives. AtP* improves upon AtP by addressing two failure modes: attention saturation and cancellation between direct and indirect effects. AtP* recomputes the attention softmax and uses dropout on the backwards pass to fix brittle false negatives. It also provides a method to bound the probability of remaining false negatives. The paper presents a systematic study of AtP and alternative methods for faster activation patching, showing that AtP significantly outperforms other methods, with AtP* providing further improvement. It also introduces a diagnostic method to estimate the residual error of AtP* and statistically bound the sizes of any remaining false negatives. The paper discusses the performance of AtP* on different model sizes and distributions, showing that it is effective across a range of settings. It also compares AtP* with alternative methods such as subsampling and hierarchical grouping, showing that AtP* is more efficient and accurate. The paper concludes that AtP* is a promising method for localizing LLM behavior to components, with potential applications in various domains.
Reach us at info@study.space