Improving Dictionary Learning with Gated Sparse Autoencoders

Improving Dictionary Learning with Gated Sparse Autoencoders

2024-5-1 | Senthoooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda
This paper introduces Gated Sparse Autoencoders (Gated SAEs), an improvement over traditional sparse autoencoders (SAEs) for discovering interpretable features in language models (LMs). Gated SAEs achieve a Pareto improvement by separating the tasks of determining which features are active and estimating their magnitudes, allowing the L1 penalty to be applied only to the former, thus reducing undesirable side effects like shrinkage. Training Gated SAEs on models up to 7B parameters shows they solve shrinkage, are equally interpretable, and require half as many firing features for comparable reconstruction fidelity. The paper evaluates Gated SAEs on multiple models and sites within models, finding them to be a Pareto improvement over baseline SAEs in terms of sparsity and reconstruction fidelity. A double-blind study suggests Gated SAE features are comparably interpretable to baseline SAE features. The key contributions include introducing Gated SAEs, showing they improve the sparsity and reconstruction fidelity trade-off, overcoming shrinkage, and providing evidence that Gated SAE features are interpretable. The paper also discusses the limitations of SAEs and the potential of Gated SAEs to improve dictionary learning in LLMs. The results suggest that Gated SAEs can enhance work in interpreting language models, understanding their components, and steering their behavior.This paper introduces Gated Sparse Autoencoders (Gated SAEs), an improvement over traditional sparse autoencoders (SAEs) for discovering interpretable features in language models (LMs). Gated SAEs achieve a Pareto improvement by separating the tasks of determining which features are active and estimating their magnitudes, allowing the L1 penalty to be applied only to the former, thus reducing undesirable side effects like shrinkage. Training Gated SAEs on models up to 7B parameters shows they solve shrinkage, are equally interpretable, and require half as many firing features for comparable reconstruction fidelity. The paper evaluates Gated SAEs on multiple models and sites within models, finding them to be a Pareto improvement over baseline SAEs in terms of sparsity and reconstruction fidelity. A double-blind study suggests Gated SAE features are comparably interpretable to baseline SAE features. The key contributions include introducing Gated SAEs, showing they improve the sparsity and reconstruction fidelity trade-off, overcoming shrinkage, and providing evidence that Gated SAE features are interpretable. The paper also discusses the limitations of SAEs and the potential of Gated SAEs to improve dictionary learning in LLMs. The results suggest that Gated SAEs can enhance work in interpreting language models, understanding their components, and steering their behavior.
Reach us at info@study.space
[slides and audio] Improving Dictionary Learning with Gated Sparse Autoencoders