Understanding Improving Dictionary Learning with Gated Sparse Autoencoders

The paper introduces the Gated Sparse Autoencoder (Gated SAE), a modification to the standard Sparse Autoencoder (SAE) architecture designed to improve the sparsity and reconstruction fidelity of language model activations. The key insight of Gated SAEs is to separate the functionality of determining which directions to use and estimating the magnitudes of those directions, applying the L1 penalty only to the former to limit the scope of undesirable side effects. This approach addresses the issue of shrinkage, where the L1 penalty in standard SAEs systematically underestimates feature activations. Through experiments on large language models, the authors demonstrate that Gated SAEs achieve a Pareto improvement over baseline SAEs in terms of reconstruction quality and sparsity, requiring half as many firing features to achieve comparable reconstruction fidelity. They also show that Gated SAEs are comparable in interpretability to baseline SAEs, although definitive conclusions are not drawn due to the limitations of current interpretability metrics. The paper includes an ablation study to validate the key components of the Gated SAE methodology and discusses related work in mechanistic interpretability and dictionary learning.The paper introduces the Gated Sparse Autoencoder (Gated SAE), a modification to the standard Sparse Autoencoder (SAE) architecture designed to improve the sparsity and reconstruction fidelity of language model activations. The key insight of Gated SAEs is to separate the functionality of determining which directions to use and estimating the magnitudes of those directions, applying the L1 penalty only to the former to limit the scope of undesirable side effects. This approach addresses the issue of shrinkage, where the L1 penalty in standard SAEs systematically underestimates feature activations. Through experiments on large language models, the authors demonstrate that Gated SAEs achieve a Pareto improvement over baseline SAEs in terms of reconstruction quality and sparsity, requiring half as many firing features to achieve comparable reconstruction fidelity. They also show that Gated SAEs are comparable in interpretability to baseline SAEs, although definitive conclusions are not drawn due to the limitations of current interpretability metrics. The paper includes an ablation study to validate the key components of the Gated SAE methodology and discusses related work in mechanistic interpretability and dictionary learning.

Improving Dictionary Learning with Gated Sparse Autoencoders

2024-5-1 | Senthooran Rajamanoharan *, Arthur Conny *, Lewis Smith, Tom Lieberum†, Vikrant Varma†, János Kramár, Rohin Shah and Neel Nanda

2024-5-1 | Senthooran Rajamanoharan , Arthur Conny , Lewis Smith, Tom Lieberum†, Vikrant Varma†, János Kramár, Rohin Shah and Neel Nanda