Scaling and evaluating sparse autoencoders

Scaling and evaluating sparse autoencoders

6 Jun 2024 | Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu
The paper "Scaling and Evaluating Sparse Autoencoders" by Leo Gao explores the use of sparse autoencoders to extract interpretable features from language models. The authors address the challenges of training large-scale sparse autoencoders, particularly the balance between reconstruction quality and sparsity, and the presence of dead latents. They propose using k-sparse autoencoders to directly control sparsity, which simplifies tuning and improves the reconstruction-sparsity frontier. The study finds that modifications can reduce the number of dead latents even at large scales. The authors also introduce new metrics to evaluate feature quality, including downstream loss, probe loss, explainability, and sparsity of downstream effects. These metrics generally improve with larger autoencoders. To demonstrate the scalability of their approach, they train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. The paper includes detailed methods, scaling laws, and evaluations, and releases code and autoencoders for open-source models, along with a visualizer.The paper "Scaling and Evaluating Sparse Autoencoders" by Leo Gao explores the use of sparse autoencoders to extract interpretable features from language models. The authors address the challenges of training large-scale sparse autoencoders, particularly the balance between reconstruction quality and sparsity, and the presence of dead latents. They propose using k-sparse autoencoders to directly control sparsity, which simplifies tuning and improves the reconstruction-sparsity frontier. The study finds that modifications can reduce the number of dead latents even at large scales. The authors also introduce new metrics to evaluate feature quality, including downstream loss, probe loss, explainability, and sparsity of downstream effects. These metrics generally improve with larger autoencoders. To demonstrate the scalability of their approach, they train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. The paper includes detailed methods, scaling laws, and evaluations, and releases code and autoencoders for open-source models, along with a visualizer.
Reach us at info@study.space
[slides] Scaling and evaluating sparse autoencoders | StudySpace