[slides and audio] Sparse Feature Circuits%3A Discovering and Editing Interpretable Causal Graphs in Language Models

The paper introduces methods for discovering and applying sparse feature circuits, which are causally implicated subnetworks of human-interpretable features in language models. These circuits are designed to provide detailed understanding of unanticipated mechanisms and are useful for downstream tasks. The authors propose SHIFT, a technique that improves classifier generalization by abating features deemed task-irrelevant by humans. They also demonstrate an unsupervised and scalable interpretability pipeline by automatically discovering thousands of sparse feature circuits for various model behaviors. The key contributions include: 1. **Sparse Feature Circuits**: A method to identify fine-grained, human-interpretable units that explain model behaviors. 2. **SHIFT**: A technique to improve classifier generalization by removing sensitivity to unintended signals. 3. **Unsupervised Pipeline**: An automated approach to discover feature circuits for thousands of automatically discovered model behaviors. The paper evaluates these methods on subject-verb agreement tasks and shows that sparse feature circuits are more interpretable and concise than neuron circuits. The SHIFT technique is demonstrated to effectively remove unintended signals from classifiers without disambiguating labels. The authors also present a case study on subject-verb agreement across relative clauses, providing insights into how Pythia-70M arrives at its decisions. Finally, they discuss the limitations of their approach, such as the need for sparse autoencoders and the challenge of evaluating circuits without downstream tasks.The paper introduces methods for discovering and applying sparse feature circuits, which are causally implicated subnetworks of human-interpretable features in language models. These circuits are designed to provide detailed understanding of unanticipated mechanisms and are useful for downstream tasks. The authors propose SHIFT, a technique that improves classifier generalization by abating features deemed task-irrelevant by humans. They also demonstrate an unsupervised and scalable interpretability pipeline by automatically discovering thousands of sparse feature circuits for various model behaviors. The key contributions include: 1. **Sparse Feature Circuits**: A method to identify fine-grained, human-interpretable units that explain model behaviors. 2. **SHIFT**: A technique to improve classifier generalization by removing sensitivity to unintended signals. 3. **Unsupervised Pipeline**: An automated approach to discover feature circuits for thousands of automatically discovered model behaviors. The paper evaluates these methods on subject-verb agreement tasks and shows that sparse feature circuits are more interpretable and concise than neuron circuits. The SHIFT technique is demonstrated to effectively remove unintended signals from classifiers without disambiguating labels. The authors also present a case study on subject-verb agreement across relative clauses, providing insights into how Pythia-70M arrives at its decisions. Finally, they discuss the limitations of their approach, such as the need for sparse autoencoders and the challenge of evaluating circuits without downstream tasks.

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

31 Mar 2024 | Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, Aaron Mueller