Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

24 May 2024 | Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey
This paper introduces end-to-end (e2e) sparse dictionary learning as a method to train sparse autoencoders (SAEs) that identify functionally important features in neural networks. Traditional SAEs aim to minimize mean squared error (MSE) in activation reconstruction, but this may not align with the network's actual functional importance. E2e SAEs instead minimize the KL divergence between the output distributions of the original model and the model with SAE activations inserted, ensuring the learned features are functionally important. This approach offers a Pareto improvement over standard SAEs by explaining more network performance, requiring fewer features, and having fewer active features per datapoint, without sacrificing interpretability. The study compares e2e SAEs with standard SAEs (SAE_local) and e2e SAEs with downstream reconstruction (SAE_e2e+ds). Results show that e2e SAEs require fewer features per datapoint and fewer total features, and their features are more interpretable. e2e+ds SAEs perform similarly to e2e SAEs in terms of features per datapoint but have reconstruction errors closer to SAE_local, indicating they follow similar computational pathways. The e2e SAEs also show better performance in terms of interpretability, with features that are at least as interpretable as those of SAE_local. The paper also explores geometric and qualitative differences between SAE types, finding that e2e SAEs have more orthogonal features and are less prone to feature splitting. However, e2e SAEs can be less stable across different random seeds compared to SAE_local and SAE_e2e+ds. The study further demonstrates that e2e SAEs maintain their advantage even when compared to SAEs trained with more computational resources. The results suggest that standard SAEs may be capturing information about dataset structure that is not maximally useful for explaining the network's performance. By directly optimizing for functional importance, e2e SAEs offer a more targeted approach to identifying the essential features that contribute to a network's performance. The paper also provides a library for training e2e SAEs and reproducing the analysis, along with additional results on other layers and models.This paper introduces end-to-end (e2e) sparse dictionary learning as a method to train sparse autoencoders (SAEs) that identify functionally important features in neural networks. Traditional SAEs aim to minimize mean squared error (MSE) in activation reconstruction, but this may not align with the network's actual functional importance. E2e SAEs instead minimize the KL divergence between the output distributions of the original model and the model with SAE activations inserted, ensuring the learned features are functionally important. This approach offers a Pareto improvement over standard SAEs by explaining more network performance, requiring fewer features, and having fewer active features per datapoint, without sacrificing interpretability. The study compares e2e SAEs with standard SAEs (SAE_local) and e2e SAEs with downstream reconstruction (SAE_e2e+ds). Results show that e2e SAEs require fewer features per datapoint and fewer total features, and their features are more interpretable. e2e+ds SAEs perform similarly to e2e SAEs in terms of features per datapoint but have reconstruction errors closer to SAE_local, indicating they follow similar computational pathways. The e2e SAEs also show better performance in terms of interpretability, with features that are at least as interpretable as those of SAE_local. The paper also explores geometric and qualitative differences between SAE types, finding that e2e SAEs have more orthogonal features and are less prone to feature splitting. However, e2e SAEs can be less stable across different random seeds compared to SAE_local and SAE_e2e+ds. The study further demonstrates that e2e SAEs maintain their advantage even when compared to SAEs trained with more computational resources. The results suggest that standard SAEs may be capturing information about dataset structure that is not maximally useful for explaining the network's performance. By directly optimizing for functional importance, e2e SAEs offer a more targeted approach to identifying the essential features that contribute to a network's performance. The paper also provides a library for training e2e SAEs and reproducing the analysis, along with additional results on other layers and models.
Reach us at info@study.space