Transcoders Find Interpretable LLM Feature Circuits

Transcoders Find Interpretable LLM Feature Circuits

17 Jun 2024 | Jacob Dunefsky, Philippe Chlenski, Neel Nanda
The paper introduces transcoders as a method to approximate dense MLP sublayers with wider, sparsely-activating MLP layers, addressing the challenge of fine-grained circuit analysis in transformer-based language models. Transcoders are trained to faithfully approximate the output of an original MLP sublayer while encouraging sparse activations. The authors demonstrate that transcoders perform at least as well as sparse autoencoders (SAEs) in terms of sparsity, faithfulness, and human-interpretability. They also introduce a novel method for using transcoders to perform weights-based circuit analysis, which cleanly factorizes circuits into input-dependent and input-invariant terms. The method is applied to various tasks, including blind case studies and an in-depth analysis of the "greater-than" circuit in GPT2-small. The results suggest that transcoders can effectively decompose model computations involving MLPs into interpretable circuits. The code for training transcoders and conducting experiments is available on GitHub.The paper introduces transcoders as a method to approximate dense MLP sublayers with wider, sparsely-activating MLP layers, addressing the challenge of fine-grained circuit analysis in transformer-based language models. Transcoders are trained to faithfully approximate the output of an original MLP sublayer while encouraging sparse activations. The authors demonstrate that transcoders perform at least as well as sparse autoencoders (SAEs) in terms of sparsity, faithfulness, and human-interpretability. They also introduce a novel method for using transcoders to perform weights-based circuit analysis, which cleanly factorizes circuits into input-dependent and input-invariant terms. The method is applied to various tasks, including blind case studies and an in-depth analysis of the "greater-than" circuit in GPT2-small. The results suggest that transcoders can effectively decompose model computations involving MLPs into interpretable circuits. The code for training transcoders and conducting experiments is available on GitHub.
Reach us at info@study.space
[slides] Transcoders Find Interpretable LLM Feature Circuits | StudySpace