Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

2024-03-31 | Samuel Marks*, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, Aaron Mueller*
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models This paper introduces methods for discovering and applying sparse feature circuits, which are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Unlike prior work that focuses on polysemantic and difficult-to-interpret units like attention heads or neurons, sparse feature circuits enable detailed understanding of unanticipated mechanisms. They are based on fine-grained units and are useful for downstream tasks. The authors introduce SHIFT, a technique that improves the generalization of a classifier by ablating features deemed task-irrelevant. They also demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors. The paper proposes using sparse autoencoders (SAEs) to identify interpretable directions in a language model's latent space. They then use linear approximations to efficiently identify SAE features most causally implicated in model behaviors and their connections. This results in sparse feature circuits that explain how model behaviors arise via interactions among fine-grained human-interpretable units. The authors evaluate their method on subject-verb agreement tasks and find that sparse feature circuits are more interpretable and concise than circuits consisting of neurons. They also demonstrate that their method can be used to remove sensitivity to unintended signals without disambiguating data, as shown in a case study where a classifier was debiased in a worst-case setting where an unintended signal (gender) was perfectly predictive of target labels (profession). The paper also presents an unsupervised circuit discovery approach that automatically identifies thousands of LM behaviors and their corresponding feature circuits. The authors highlight the scalability of their method and its potential for future research. They release code, data, and autoencoders at github.com/saprmarks/feature-circuits.Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models This paper introduces methods for discovering and applying sparse feature circuits, which are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Unlike prior work that focuses on polysemantic and difficult-to-interpret units like attention heads or neurons, sparse feature circuits enable detailed understanding of unanticipated mechanisms. They are based on fine-grained units and are useful for downstream tasks. The authors introduce SHIFT, a technique that improves the generalization of a classifier by ablating features deemed task-irrelevant. They also demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors. The paper proposes using sparse autoencoders (SAEs) to identify interpretable directions in a language model's latent space. They then use linear approximations to efficiently identify SAE features most causally implicated in model behaviors and their connections. This results in sparse feature circuits that explain how model behaviors arise via interactions among fine-grained human-interpretable units. The authors evaluate their method on subject-verb agreement tasks and find that sparse feature circuits are more interpretable and concise than circuits consisting of neurons. They also demonstrate that their method can be used to remove sensitivity to unintended signals without disambiguating data, as shown in a case study where a classifier was debiased in a worst-case setting where an unintended signal (gender) was perfectly predictive of target labels (profession). The paper also presents an unsupervised circuit discovery approach that automatically identifies thousands of LM behaviors and their corresponding feature circuits. The authors highlight the scalability of their method and its potential for future research. They release code, data, and autoencoders at github.com/saprmarks/feature-circuits.
Reach us at info@study.space
Understanding Sparse Feature Circuits%3A Discovering and Editing Interpretable Causal Graphs in Language Models