[slides] Interpreting Attention Layer Outputs with Sparse Autoencoders

This paper introduces the use of Sparse Autoencoders (SAEs) to interpret the outputs of attention layers in transformers, aiming to decompose high-dimensional activations into interpretable features. The authors train SAEs on attention layer outputs and demonstrate that they can extract sparse, interpretable reconstructions. They perform a qualitative study of the features computed by attention layers, identifying multiple families: long-range context, short-range context, and induction features. The paper also explores the polysemantic nature of attention heads in GPT-2 Small, finding that at least 90% of the heads have multiple unrelated roles. Additionally, the authors use SAEs to investigate the mystery of redundant induction heads and confirm the hypothesis that some specialize in "long prefix induction" while others specialize in "short prefix induction." They further apply SAEs to analyze the Indirect Object Identification circuit, revealing causally relevant intermediate variables and deepening our understanding of its semantics. The paper concludes by introducing Recursive Direct Feature Attribution (RDFA) and releasing an interactive tool for exploring attention output SAEs.This paper introduces the use of Sparse Autoencoders (SAEs) to interpret the outputs of attention layers in transformers, aiming to decompose high-dimensional activations into interpretable features. The authors train SAEs on attention layer outputs and demonstrate that they can extract sparse, interpretable reconstructions. They perform a qualitative study of the features computed by attention layers, identifying multiple families: long-range context, short-range context, and induction features. The paper also explores the polysemantic nature of attention heads in GPT-2 Small, finding that at least 90% of the heads have multiple unrelated roles. Additionally, the authors use SAEs to investigate the mystery of redundant induction heads and confirm the hypothesis that some specialize in "long prefix induction" while others specialize in "short prefix induction." They further apply SAEs to analyze the Indirect Object Identification circuit, revealing causally relevant intermediate variables and deepening our understanding of its semantics. The paper concludes by introducing Recursive Direct Feature Attribution (RDFA) and releasing an interactive tool for exploring attention output SAEs.

Interpreting Attention Layer Outputs with Sparse Autoencoders

25 Jun 2024 | Connor Kissane, Robert Krzyzanowski, Joseph Bloom, Arthur Conmy, Neel Nanda