2 Jul 2024 | Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao
This paper provides a comprehensive review of mechanistic interpretability (MI) in the context of transformer-based language models (LMs). MI aims to understand LMs by reverse-engineering their internal computations, offering insights into the functions of LM components and enabling better utilization of these models. The paper outlines fundamental objects of study in MI, including features, circuits, and universality, and reviews techniques used for MI analysis, such as logit lens, probing, sparse autoencoder (SAE), visualization, automated feature explanation, knockout/ablation, and causal mediation analysis (CMA). It also discusses evaluation methods, focusing on faithfulness, completeness, and minimality. The paper presents a beginner's roadmap for navigating the field and highlights significant findings and applications, such as monosemantics vs. polysemantics, superposition, in-context learning, reasoning, and learning dynamics. Additionally, it explores the practical utility of MI in model enhancement, AI safety, and other downstream tasks. Finally, the paper identifies current gaps and potential future directions, emphasizing the need for automated hypothesis generation, studies on complex tasks and large LMs, standardized benchmarks, and further research on encoder-only and encoder-decoder LMs.This paper provides a comprehensive review of mechanistic interpretability (MI) in the context of transformer-based language models (LMs). MI aims to understand LMs by reverse-engineering their internal computations, offering insights into the functions of LM components and enabling better utilization of these models. The paper outlines fundamental objects of study in MI, including features, circuits, and universality, and reviews techniques used for MI analysis, such as logit lens, probing, sparse autoencoder (SAE), visualization, automated feature explanation, knockout/ablation, and causal mediation analysis (CMA). It also discusses evaluation methods, focusing on faithfulness, completeness, and minimality. The paper presents a beginner's roadmap for navigating the field and highlights significant findings and applications, such as monosemantics vs. polysemantics, superposition, in-context learning, reasoning, and learning dynamics. Additionally, it explores the practical utility of MI in model enhancement, AI safety, and other downstream tasks. Finally, the paper identifies current gaps and potential future directions, emphasizing the need for automated hypothesis generation, studies on complex tasks and large LMs, standardized benchmarks, and further research on encoder-only and encoder-decoder LMs.