24 May 2024 | Dan Braun*, Jordan Taylor†, Nicholas Goldowsky-Dill*, Lee Sharkey*
The paper introduces a novel method called end-to-end (e2e) sparse dictionary learning to identify functionally important features in neural networks. Traditional sparse autoencoders (SAEs) learn a sparse, overcomplete dictionary to reconstruct internal activations, but they may focus more on the dataset structure rather than the computational structure of the network. To address this, the authors propose e2e SAEs, which minimize the KL divergence between the output distributions of the original model and the model with SAE activations inserted. This ensures that the learned features are functionally important to the network's performance.
Compared to standard SAEs, e2e SAEs offer a Pareto improvement: they explain more network performance, require fewer total features, and need fewer simultaneously active features per datapoint, all without sacrificing interpretability. The authors explore geometric and qualitative differences between e2e SAE features and standard SAE features, demonstrating that e2e SAEs capture essential features more efficiently.
The paper includes experiments on language models (GPT2-small and Tinystories-1M) to validate the effectiveness of e2e SAEs. Key findings include:
1. E2e SAEs require fewer features per datapoint and fewer total features over the dataset compared to standard SAEs for the same level of performance explained.
2. E2e SAEs with additional downstream reconstruction loss (SAE$_{e2e+ds}$) achieve similar performance explained as SAE$_{e2e}$ while maintaining activations that follow similar pathways through later layers.
3. The improved efficiency of e2e SAEs does not come at the cost of interpretability, as measured by automated interpretability scores and qualitative analysis.
The authors also provide a library for training e2e SAEs and reproducing their analysis, along with detailed experimental metrics and results.The paper introduces a novel method called end-to-end (e2e) sparse dictionary learning to identify functionally important features in neural networks. Traditional sparse autoencoders (SAEs) learn a sparse, overcomplete dictionary to reconstruct internal activations, but they may focus more on the dataset structure rather than the computational structure of the network. To address this, the authors propose e2e SAEs, which minimize the KL divergence between the output distributions of the original model and the model with SAE activations inserted. This ensures that the learned features are functionally important to the network's performance.
Compared to standard SAEs, e2e SAEs offer a Pareto improvement: they explain more network performance, require fewer total features, and need fewer simultaneously active features per datapoint, all without sacrificing interpretability. The authors explore geometric and qualitative differences between e2e SAE features and standard SAE features, demonstrating that e2e SAEs capture essential features more efficiently.
The paper includes experiments on language models (GPT2-small and Tinystories-1M) to validate the effectiveness of e2e SAEs. Key findings include:
1. E2e SAEs require fewer features per datapoint and fewer total features over the dataset compared to standard SAEs for the same level of performance explained.
2. E2e SAEs with additional downstream reconstruction loss (SAE$_{e2e+ds}$) achieve similar performance explained as SAE$_{e2e}$ while maintaining activations that follow similar pathways through later layers.
3. The improved efficiency of e2e SAEs does not come at the cost of interpretability, as measured by automated interpretability scores and qualitative analysis.
The authors also provide a library for training e2e SAEs and reproducing their analysis, along with detailed experimental metrics and results.