[slides and audio] Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

The paper addresses the challenge of evaluating sparse feature dictionaries in the context of specific tasks, particularly in the context of disentangling model activations into meaningful features. The authors propose a framework that uses supervised feature dictionaries as a benchmark to evaluate unsupervised dictionaries. They demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability on the task. The framework is applied to the Indirect Object Identification (IOI) task using GPT-2 Small, where sparse autoencoders (SAEs) are trained on either the IOI or OpenWebText datasets. The results show that these SAEs capture interpretable features for the IOI task but are less successful than supervised features in controlling the model. The authors also observe two qualitative phenomena in SAE training: feature occlusion, where causally relevant concepts are overshadowed by higher-magnitude ones, and feature over-splitting, where binary features split into many smaller, less interpretable features. The paper aims to provide a more objective and grounded evaluation of sparse dictionary learning methods.The paper addresses the challenge of evaluating sparse feature dictionaries in the context of specific tasks, particularly in the context of disentangling model activations into meaningful features. The authors propose a framework that uses supervised feature dictionaries as a benchmark to evaluate unsupervised dictionaries. They demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability on the task. The framework is applied to the Indirect Object Identification (IOI) task using GPT-2 Small, where sparse autoencoders (SAEs) are trained on either the IOI or OpenWebText datasets. The results show that these SAEs capture interpretable features for the IOI task but are less successful than supervised features in controlling the model. The authors also observe two qualitative phenomena in SAE training: feature occlusion, where causally relevant concepts are overshadowed by higher-magnitude ones, and feature over-splitting, where binary features split into many smaller, less interpretable features. The paper aims to provide a more objective and grounded evaluation of sparse dictionary learning methods.

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

20 May 2024 | Aleksandar Makelov*, Georg Lange*, Neel Nanda

20 May 2024 | Aleksandar Makelov, Georg Lange, Neel Nanda