[slides] Jumping Ahead%3A Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

This paper introduces JumpReLU Sparse Autoencoders (JumpReLU SAEs), a novel approach to improving the reconstruction fidelity of sparse autoencoders (SAEs) in language models (LMs). SAEs are used to identify causally relevant and interpretable linear features in LM activations, but balancing sparsity and reconstruction fidelity is challenging. JumpReLU SAEs replace the ReLU activation function with a JumpReLU function, which sets pre-activations below a positive threshold to zero, improving reconstruction fidelity while maintaining sparsity. The authors demonstrate that JumpReLU SAEs achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations compared to other recent methods like Gated and TopK SAEs. They also show that JumpReLU SAEs maintain interpretability through manual and automated interpretability studies. The training of JumpReLU SAEs is facilitated by the use of straight-through-estimators (STEs), which allow for efficient gradient estimation despite the discontinuity introduced by the JumpReLU function. The paper evaluates JumpReLU SAEs on various metrics, including sparsity-fidelity trade-offs and feature activation frequencies, and compares them to Gated and TopK SAEs. The results show that JumpReLU SAEs consistently provide better or comparable reconstruction fidelity and interpretability, making them a promising improvement over existing SAE training methodologies.This paper introduces JumpReLU Sparse Autoencoders (JumpReLU SAEs), a novel approach to improving the reconstruction fidelity of sparse autoencoders (SAEs) in language models (LMs). SAEs are used to identify causally relevant and interpretable linear features in LM activations, but balancing sparsity and reconstruction fidelity is challenging. JumpReLU SAEs replace the ReLU activation function with a JumpReLU function, which sets pre-activations below a positive threshold to zero, improving reconstruction fidelity while maintaining sparsity. The authors demonstrate that JumpReLU SAEs achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations compared to other recent methods like Gated and TopK SAEs. They also show that JumpReLU SAEs maintain interpretability through manual and automated interpretability studies. The training of JumpReLU SAEs is facilitated by the use of straight-through-estimators (STEs), which allow for efficient gradient estimation despite the discontinuity introduced by the JumpReLU function. The paper evaluates JumpReLU SAEs on various metrics, including sparsity-fidelity trade-offs and feature activation frequencies, and compares them to Gated and TopK SAEs. The results show that JumpReLU SAEs consistently provide better or comparable reconstruction fidelity and interpretability, making them a promising improvement over existing SAE training methodologies.

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

2024-8-2 | Senthooran Rajamanoharan*, Tom Lieberum†, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda