25 Jun 2024 | Connor Kissane*, Robert Krzyzanowski*, Joseph Bloom, Arthur Conmy, Neel Nanda
This paper introduces Attention Output Sparse Autoencoders (SAEs) to decompose attention layer outputs into sparse, interpretable features. The authors train SAEs on attention outputs from various transformer models, including GPT-2 Small, and demonstrate that these SAEs can identify meaningful feature families such as long-range context, short-range context, and induction features. They also show that attention heads are often polysemantic, performing multiple unrelated tasks. By using SAEs, the authors gain deeper insights into how attention layers function, including the role of different heads in tasks like induction. They also apply SAEs to analyze the Indirect Object Identification circuit, revealing causally relevant features. The authors introduce Recursive Direct Feature Attribution (RDFA) to trace model computations on arbitrary prompts and release a visualization tool for exploring attention outputs through SAEs. The results show that SAEs provide sparse, faithful, and interpretable reconstructions of attention outputs, enabling detailed analysis of model behavior and circuit functionality. The work highlights the value of Attention Output SAEs as a general-purpose tool for mechanistic interpretability.This paper introduces Attention Output Sparse Autoencoders (SAEs) to decompose attention layer outputs into sparse, interpretable features. The authors train SAEs on attention outputs from various transformer models, including GPT-2 Small, and demonstrate that these SAEs can identify meaningful feature families such as long-range context, short-range context, and induction features. They also show that attention heads are often polysemantic, performing multiple unrelated tasks. By using SAEs, the authors gain deeper insights into how attention layers function, including the role of different heads in tasks like induction. They also apply SAEs to analyze the Indirect Object Identification circuit, revealing causally relevant features. The authors introduce Recursive Direct Feature Attribution (RDFA) to trace model computations on arbitrary prompts and release a visualization tool for exploring attention outputs through SAEs. The results show that SAEs provide sparse, faithful, and interpretable reconstructions of attention outputs, enabling detailed analysis of model behavior and circuit functionality. The work highlights the value of Attention Output SAEs as a general-purpose tool for mechanistic interpretability.