RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

27 Feb 2024 | Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger
RAVEL (Resolving Attribute–Value Entanglements in Language Models) is a benchmark dataset designed to evaluate the effectiveness of interpretability methods in disentangling the roles of neurons in language models. The dataset includes five types of entities (cities, people names, verbs, physical objects, and occupations) with at least 500 instances each, each having multiple attributes. The goal is to assess how well interpretability methods can localize and disentangle these attributes while generalizing to new cases. The evaluation metric used is based on interchange interventions, which involve changing the value of a feature to see if it affects the model's output. The paper introduces Multi-task Distributed Alignment Search (MDAS), a new method that learns a feature space satisfying multiple causal criteria. MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of identifying distributed representations across neurons. The paper also discusses the limitations of existing interpretability methods and suggests future directions for research, emphasizing the need for methods that can isolate individual concepts and disentangle representations across layers. The benchmark and methods are released on GitHub for further research and comparison.RAVEL (Resolving Attribute–Value Entanglements in Language Models) is a benchmark dataset designed to evaluate the effectiveness of interpretability methods in disentangling the roles of neurons in language models. The dataset includes five types of entities (cities, people names, verbs, physical objects, and occupations) with at least 500 instances each, each having multiple attributes. The goal is to assess how well interpretability methods can localize and disentangle these attributes while generalizing to new cases. The evaluation metric used is based on interchange interventions, which involve changing the value of a feature to see if it affects the model's output. The paper introduces Multi-task Distributed Alignment Search (MDAS), a new method that learns a feature space satisfying multiple causal criteria. MDAS achieves state-of-the-art results on RAVEL, demonstrating the importance of identifying distributed representations across neurons. The paper also discusses the limitations of existing interpretability methods and suggests future directions for research, emphasizing the need for methods that can isolate individual concepts and disentangle representations across layers. The benchmark and methods are released on GitHub for further research and comparison.
Reach us at info@study.space