13 Oct 2024 | Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà
This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models (LMs), focusing on the generative decoder-only architecture. It presents a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area. The paper introduces the components of a Transformer LM, including the Transformer layer, layer normalization, attention block, and feedforward network block. It discusses the prediction head and Transformer decompositions, highlighting the role of the unembedding matrix in transforming the last residual stream state into a next-token distribution of logits. The paper also covers behavior localization, including input attribution, model component attribution, and causal interventions, as well as information decoding techniques such as probing and the linear representation hypothesis. The work emphasizes the importance of understanding the inner workings of LMs for ensuring safety, fairness, and model improvements. It also discusses the limitations of current interpretability methods and proposes new approaches for more accurate and efficient analysis of model behavior.This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models (LMs), focusing on the generative decoder-only architecture. It presents a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area. The paper introduces the components of a Transformer LM, including the Transformer layer, layer normalization, attention block, and feedforward network block. It discusses the prediction head and Transformer decompositions, highlighting the role of the unembedding matrix in transforming the last residual stream state into a next-token distribution of logits. The paper also covers behavior localization, including input attribution, model component attribution, and causal interventions, as well as information decoding techniques such as probing and the linear representation hypothesis. The work emphasizes the importance of understanding the inner workings of LMs for ensuring safety, fairness, and model improvements. It also discusses the limitations of current interpretability methods and proposes new approaches for more accurate and efficient analysis of model behavior.