Transcoders Find Interpretable LLM Feature Circuits

Transcoders Find Interpretable LLM Feature Circuits

17 Jun 2024 | Jacob Dunefsky, Philippe Chlenski, Neel Nanda
Transcoders are used to approximate MLP sublayers in transformer-based language models, enabling more interpretable circuit analysis. This paper introduces transcoders, which are wide, sparsely-activating MLP layers trained to faithfully approximate the output of original MLP sublayers. Transcoders are trained with an L1 regularization penalty to encourage sparse activations, making them interpretable and faithful to the original computations. The paper compares transcoders to sparse autoencoders (SAEs) and finds that transcoders perform at least as well as SAEs in terms of sparsity, faithfulness, and human-interpretability. Transcoders enable a novel method for performing weights-based circuit analysis through MLP sublayers, which cleanly factorizes circuits into input-dependent and input-invariant terms. The paper applies transcoders to reverse-engineer unknown circuits in the model and obtains novel insights regarding the "greater-than circuit" in GPT2-small. Results suggest that transcoders can effectively decompose model computations involving MLPs into interpretable circuits. The code for training transcoders and carrying out experiments is available at https://github.com/jacobdunefsky/transcoder_circuits.Transcoders are used to approximate MLP sublayers in transformer-based language models, enabling more interpretable circuit analysis. This paper introduces transcoders, which are wide, sparsely-activating MLP layers trained to faithfully approximate the output of original MLP sublayers. Transcoders are trained with an L1 regularization penalty to encourage sparse activations, making them interpretable and faithful to the original computations. The paper compares transcoders to sparse autoencoders (SAEs) and finds that transcoders perform at least as well as SAEs in terms of sparsity, faithfulness, and human-interpretability. Transcoders enable a novel method for performing weights-based circuit analysis through MLP sublayers, which cleanly factorizes circuits into input-dependent and input-invariant terms. The paper applies transcoders to reverse-engineer unknown circuits in the model and obtains novel insights regarding the "greater-than circuit" in GPT2-small. Results suggest that transcoders can effectively decompose model computations involving MLPs into interpretable circuits. The code for training transcoders and carrying out experiments is available at https://github.com/jacobdunefsky/transcoder_circuits.
Reach us at info@study.space
[slides] Transcoders Find Interpretable LLM Feature Circuits | StudySpace