Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

2024-8-2 | Senthoooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda
JumpReLU SAEs are a modified version of standard sparse autoencoders (SAEs) that use a JumpReLU activation function instead of ReLU. This change allows JumpReLU SAEs to achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, outperforming other recent SAE architectures like Gated and TopK SAEs. The JumpReLU activation function zeroes out pre-activations below a positive threshold, improving sparsity while maintaining reconstruction fidelity. This is achieved by using straight-through estimators (STEs) to train the model, allowing the use of a L0 sparsity penalty instead of proxies like L1, which can lead to shrinkage issues. JumpReLU SAEs are efficient to train and run, and they maintain interpretability through both manual and automated studies. The paper evaluates JumpReLU, Gated, and TopK SAEs on Gemma 2 9B activations and finds that JumpReLU SAEs consistently provide more faithful reconstructions at a given sparsity level. They also show that JumpReLU SAEs have similar interpretability to Gated and TopK SAEs. The study highlights the effectiveness of JumpReLU SAEs in balancing sparsity and reconstruction fidelity, and their potential for improving mechanistic interpretability in language models.JumpReLU SAEs are a modified version of standard sparse autoencoders (SAEs) that use a JumpReLU activation function instead of ReLU. This change allows JumpReLU SAEs to achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, outperforming other recent SAE architectures like Gated and TopK SAEs. The JumpReLU activation function zeroes out pre-activations below a positive threshold, improving sparsity while maintaining reconstruction fidelity. This is achieved by using straight-through estimators (STEs) to train the model, allowing the use of a L0 sparsity penalty instead of proxies like L1, which can lead to shrinkage issues. JumpReLU SAEs are efficient to train and run, and they maintain interpretability through both manual and automated studies. The paper evaluates JumpReLU, Gated, and TopK SAEs on Gemma 2 9B activations and finds that JumpReLU SAEs consistently provide more faithful reconstructions at a given sparsity level. They also show that JumpReLU SAEs have similar interpretability to Gated and TopK SAEs. The study highlights the effectiveness of JumpReLU SAEs in balancing sparsity and reconstruction fidelity, and their potential for improving mechanistic interpretability in language models.
Reach us at info@futurestudyspace.com
[slides and audio] Jumping Ahead%3A Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders