Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

20 May 2024 | Aleksandar Makelov, Georg Lange, Neel Nanda
This paper presents a framework for evaluating sparse autoencoders (SAEs) for interpretability and control in the context of specific tasks. The authors propose a method to assess feature dictionaries by comparing them against supervised feature dictionaries. They apply this framework to the Indirect Object Identification (IOI) task using GPT-2 Small, with SAEs trained on either the IOI or OpenWebText datasets. They find that these SAEs capture interpretable features for the IOI task, but are less successful than supervised features in controlling the model. They also observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is overshadowed by higher-magnitude features) and feature over-splitting (where binary features split into many smaller, less interpretable features). The authors hope that their framework will provide a useful step towards more objective and grounded evaluations of sparse dictionary learning methods. The paper also discusses the linear representation hypothesis and sparse autoencoders, and presents methods for evaluating feature dictionaries in the context of specific tasks. The authors find that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. They also find that task-specific SAEs allow for fewer feature changes to edit attributes compared to full-distribution SAEs. The paper concludes that supervised feature dictionaries can be a valuable tool for automating aspects of this process.This paper presents a framework for evaluating sparse autoencoders (SAEs) for interpretability and control in the context of specific tasks. The authors propose a method to assess feature dictionaries by comparing them against supervised feature dictionaries. They apply this framework to the Indirect Object Identification (IOI) task using GPT-2 Small, with SAEs trained on either the IOI or OpenWebText datasets. They find that these SAEs capture interpretable features for the IOI task, but are less successful than supervised features in controlling the model. They also observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is overshadowed by higher-magnitude features) and feature over-splitting (where binary features split into many smaller, less interpretable features). The authors hope that their framework will provide a useful step towards more objective and grounded evaluations of sparse dictionary learning methods. The paper also discusses the linear representation hypothesis and sparse autoencoders, and presents methods for evaluating feature dictionaries in the context of specific tasks. The authors find that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. They also find that task-specific SAEs allow for fewer feature changes to edit attributes compared to full-distribution SAEs. The paper concludes that supervised feature dictionaries can be a valuable tool for automating aspects of this process.
Reach us at info@study.space
[slides and audio] Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control