27 Feb 2024 | Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger
RAVEL is a benchmark for evaluating interpretability methods in language models (LMs), focusing on their ability to disentangle entity attributes in a causal and generalizable manner. The dataset includes five types of entities (cities, people names, verbs, physical objects, and occupations), each with multiple attributes and prompt templates. The benchmark uses interchange interventions to assess how well interpretability methods can isolate causal effects of individual attributes. The authors introduce Multi-task Distributed Alignment Search (MDAS), a method that outperforms existing techniques on RAVEL, demonstrating the importance of identifying features distributed across activations rather than neuron-level analyses. The benchmark evaluates methods based on their ability to disentangle attributes, with metrics including Cause and Iso scores. Results show that MDAS achieves the highest disentangle scores, highlighting the effectiveness of multi-task learning in isolating attributes. The study also reveals that some attribute pairs are more entangled than others, and that disentanglement improves across layers in the Llama2-7B model. The work contributes to the growing body of evidence that interpretability methods need to identify features that are distributed across neurons. The benchmark is released at https://github.com/explanare/ravel.RAVEL is a benchmark for evaluating interpretability methods in language models (LMs), focusing on their ability to disentangle entity attributes in a causal and generalizable manner. The dataset includes five types of entities (cities, people names, verbs, physical objects, and occupations), each with multiple attributes and prompt templates. The benchmark uses interchange interventions to assess how well interpretability methods can isolate causal effects of individual attributes. The authors introduce Multi-task Distributed Alignment Search (MDAS), a method that outperforms existing techniques on RAVEL, demonstrating the importance of identifying features distributed across activations rather than neuron-level analyses. The benchmark evaluates methods based on their ability to disentangle attributes, with metrics including Cause and Iso scores. Results show that MDAS achieves the highest disentangle scores, highlighting the effectiveness of multi-task learning in isolating attributes. The study also reveals that some attribute pairs are more entangled than others, and that disentanglement improves across layers in the Llama2-7B model. The work contributes to the growing body of evidence that interpretability methods need to identify features that are distributed across neurons. The benchmark is released at https://github.com/explanare/ravel.