Scaling and evaluating sparse autoencoders

Scaling and evaluating sparse autoencoders

6 Jun 2024 | Leo Gao*, Tom Dupré la Tour†, Henk Tillman†, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu†
This paper presents a study on scaling and evaluating sparse autoencoders (SAEs) for extracting interpretable features from language models. The authors propose using k-sparse autoencoders, which directly control sparsity through a TopK activation function, simplifying tuning and improving the reconstruction-sparsity trade-off. They find that these autoencoders have fewer dead latents, even at large scales, and demonstrate clean scaling laws with respect to autoencoder size and sparsity. They introduce new metrics for evaluating feature quality based on feature recovery, activation explainability, and sparsity of downstream effects. These metrics generally improve with autoencoder size. The authors train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens, demonstrating the scalability of their approach. They also release code and autoencoders for open-source models, along with a visualizer. The study shows that larger autoencoders generally produce better features, though the impact of sparsity level (L0) is more complex. Increasing L0 can improve probe loss and ablation sparsity but worsens explainability. The authors also find that TopK autoencoders outperform ReLU autoencoders in terms of reconstruction and sparsity, and that they are less affected by activation shrinkage. The paper also explores the scaling laws of SAEs, finding that MSE follows a power law with compute, and that training to convergence leads to better reconstruction. They find that the number of tokens to convergence increases with the number of latents, and that irreducible loss terms improve the quality of fits. The study also evaluates the performance of SAEs on downstream tasks, finding that k-sparse autoencoders improve more on downstream loss than on MSE. They also find that SAEs can recover known features and provide interpretable explanations, though precision is challenging to evaluate. The authors conclude that SAEs have significant potential for improving the interpretability of language models, and that further research is needed to understand the best metrics for evaluating feature relevance to downstream applications. They also suggest that combining SAEs with other techniques like MoE could improve the efficiency and scalability of autoencoder training.This paper presents a study on scaling and evaluating sparse autoencoders (SAEs) for extracting interpretable features from language models. The authors propose using k-sparse autoencoders, which directly control sparsity through a TopK activation function, simplifying tuning and improving the reconstruction-sparsity trade-off. They find that these autoencoders have fewer dead latents, even at large scales, and demonstrate clean scaling laws with respect to autoencoder size and sparsity. They introduce new metrics for evaluating feature quality based on feature recovery, activation explainability, and sparsity of downstream effects. These metrics generally improve with autoencoder size. The authors train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens, demonstrating the scalability of their approach. They also release code and autoencoders for open-source models, along with a visualizer. The study shows that larger autoencoders generally produce better features, though the impact of sparsity level (L0) is more complex. Increasing L0 can improve probe loss and ablation sparsity but worsens explainability. The authors also find that TopK autoencoders outperform ReLU autoencoders in terms of reconstruction and sparsity, and that they are less affected by activation shrinkage. The paper also explores the scaling laws of SAEs, finding that MSE follows a power law with compute, and that training to convergence leads to better reconstruction. They find that the number of tokens to convergence increases with the number of latents, and that irreducible loss terms improve the quality of fits. The study also evaluates the performance of SAEs on downstream tasks, finding that k-sparse autoencoders improve more on downstream loss than on MSE. They also find that SAEs can recover known features and provide interpretable explanations, though precision is challenging to evaluate. The authors conclude that SAEs have significant potential for improving the interpretability of language models, and that further research is needed to understand the best metrics for evaluating feature relevance to downstream applications. They also suggest that combining SAEs with other techniques like MoE could improve the efficiency and scalability of autoencoder training.
Reach us at info@study.space
[slides] Scaling and evaluating sparse autoencoders | StudySpace