Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT

Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT

19 Feb 2024 | Zhengfu He, Xuyang Ge, Qiong Tang, Tianxiang Sun, Qinyuan Cheng, Xipeng Qiu
This paper introduces a novel framework for discovering interpretable circuits in Transformers using sparse dictionary learning, particularly focusing on the Othello-GPT model. The authors address the challenge of superposition, which makes it difficult to extract human-understandable features from model activations. By decomposing activations into more monosemantic features, the framework aims to identify circuits that connect these features, providing a more detailed understanding of the model's internal computations. The key contributions of the paper include: 1. **Framework Overview**: The framework decomposes dictionary features from all modules writing to the residual stream, including embedding, attention output, and MLP output. It traces down from any logit or dictionary feature to lower-level features, computing their contributions to interpretable and local model behaviors. 2. **Dictionary Learning**: The authors train dictionaries on the output of attention and MLP layers of a synthetic task, Othello, to find interpretable features. They argue that decomposing the output of each module writing to the residual stream is more effective than other methods. 3. **Circuit Discovery**: The framework identifies meaningful subgraphs in the computational graph of the Othello model, revealing a large partition of the inner information flow. This is achieved without patching, addressing the out-of-distribution problem and improving asymptotic complexity. 4. **Case Study on Othello-GPT**: The authors apply their framework to the Othello-GPT model, a 1.2M parameter decoder-only Transformer trained on a synthetic game prediction task. They discover various interpretable circuits, including those related to board state, empty cells, and legal moves. The paper also discusses related work in mechanistic interpretability, superposition attacks, and circuit discovery, and compares the proposed method to existing patch-based methods. The authors conclude by highlighting the potential for further research in Transformer pathology and scalable circuit analysis.This paper introduces a novel framework for discovering interpretable circuits in Transformers using sparse dictionary learning, particularly focusing on the Othello-GPT model. The authors address the challenge of superposition, which makes it difficult to extract human-understandable features from model activations. By decomposing activations into more monosemantic features, the framework aims to identify circuits that connect these features, providing a more detailed understanding of the model's internal computations. The key contributions of the paper include: 1. **Framework Overview**: The framework decomposes dictionary features from all modules writing to the residual stream, including embedding, attention output, and MLP output. It traces down from any logit or dictionary feature to lower-level features, computing their contributions to interpretable and local model behaviors. 2. **Dictionary Learning**: The authors train dictionaries on the output of attention and MLP layers of a synthetic task, Othello, to find interpretable features. They argue that decomposing the output of each module writing to the residual stream is more effective than other methods. 3. **Circuit Discovery**: The framework identifies meaningful subgraphs in the computational graph of the Othello model, revealing a large partition of the inner information flow. This is achieved without patching, addressing the out-of-distribution problem and improving asymptotic complexity. 4. **Case Study on Othello-GPT**: The authors apply their framework to the Othello-GPT model, a 1.2M parameter decoder-only Transformer trained on a synthetic game prediction task. They discover various interpretable circuits, including those related to board state, empty cells, and legal moves. The paper also discusses related work in mechanistic interpretability, superposition attacks, and circuit discovery, and compares the proposed method to existing patch-based methods. The authors conclude by highlighting the potential for further research in Transformer pathology and scalable circuit analysis.
Reach us at info@study.space
[slides] Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability%3A A Case Study on Othello-GPT | StudySpace