[slides] CausalGym%3A Benchmarking causal interpretability methods on linguistic tasks

**CausalGym: Benchmarking Causal Interpretability Methods on Linguistic Tasks** **Authors:** Aryaman Arora, Dan Jurafsky, Christopher Potts **Institution:** Stanford University **Abstract:** Language models (LMs) have become powerful tools in psycholinguistic research, but most prior work focuses on behavioral measures. Model interpretability research, on the other hand, has begun to reveal the abstract causal mechanisms shaping LM behavior. To bridge these two strands of research, we introduce CausalGym. We adapt and expand the Syntax-Gym suite of tasks to benchmark the ability of interpretability methods to causally affect model behavior. Using the pythia models (14M–6.9B), we assess the causal efficacy of various interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms other methods and use it to study the learning trajectory of two challenging linguistic phenomena in pythia-1b: negative polarity item licensing and filler-gap dependencies. Our analysis shows that the mechanisms implementing these tasks are learned in discrete stages, not gradually. **Key Contributions:** - **CausalGym:** A multi-task benchmark for linguistic behaviors to measure the causal efficacy of interpretability methods. - **DAS:** Outperforms other methods in causal efficacy, demonstrating its effectiveness in controlling model behavior. - **Case Studies:** Investigate how LMs learn negative polarity item licensing and filler-gap dependencies, revealing that these mechanisms emerge in discrete stages during training. **Methods:** - **CausalGym:** Adapts and expands the Syntax-Gym suite to generate large numbers of span-aligned minimal pairs for linguistic tasks. - **DAS:** learns the intervention direction to maximize the output probability of counterfactual labels. - **Evaluation:** Measures causal efficacy using log odds-ratio and selectivity, comparing different feature-finding methods. **Results:** - **Overall Odds-Ratio:** DAS consistently finds the most causally-efficacious features, followed by probing and difference-in-means. - **Selectivity:** DAS is not more selective than probing or difference-in-means, suggesting its advantage may come from its access to model outputs during training. **Discussion:** - **Learning Dynamics:** Both NPI licensing and filler-gap dependency tracking mechanisms emerge in discrete stages during training. - **Interpretability and Safety:** While interpretability methods can help understand LM behavior, they do not guarantee safe deployment in high-risk settings. **Limitations:** - **Task Diversity:** Future work should explore interpretability methods on a broader range of linguistic and non-linguistic behaviors. - **Language Data:** The benchmark currently includes only English data, which may yield different results in other languages. - **Model Variability:** Results may differ across different models trained on the same data. **Ethics:** - Emphasizes the importance of critical thinking about the implications**CausalGym: Benchmarking Causal Interpretability Methods on Linguistic Tasks** **Authors:** Aryaman Arora, Dan Jurafsky, Christopher Potts **Institution:** Stanford University **Abstract:** Language models (LMs) have become powerful tools in psycholinguistic research, but most prior work focuses on behavioral measures. Model interpretability research, on the other hand, has begun to reveal the abstract causal mechanisms shaping LM behavior. To bridge these two strands of research, we introduce CausalGym. We adapt and expand the Syntax-Gym suite of tasks to benchmark the ability of interpretability methods to causally affect model behavior. Using the pythia models (14M–6.9B), we assess the causal efficacy of various interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms other methods and use it to study the learning trajectory of two challenging linguistic phenomena in pythia-1b: negative polarity item licensing and filler-gap dependencies. Our analysis shows that the mechanisms implementing these tasks are learned in discrete stages, not gradually. **Key Contributions:** - **CausalGym:** A multi-task benchmark for linguistic behaviors to measure the causal efficacy of interpretability methods. - **DAS:** Outperforms other methods in causal efficacy, demonstrating its effectiveness in controlling model behavior. - **Case Studies:** Investigate how LMs learn negative polarity item licensing and filler-gap dependencies, revealing that these mechanisms emerge in discrete stages during training. **Methods:** - **CausalGym:** Adapts and expands the Syntax-Gym suite to generate large numbers of span-aligned minimal pairs for linguistic tasks. - **DAS:** learns the intervention direction to maximize the output probability of counterfactual labels. - **Evaluation:** Measures causal efficacy using log odds-ratio and selectivity, comparing different feature-finding methods. **Results:** - **Overall Odds-Ratio:** DAS consistently finds the most causally-efficacious features, followed by probing and difference-in-means. - **Selectivity:** DAS is not more selective than probing or difference-in-means, suggesting its advantage may come from its access to model outputs during training. **Discussion:** - **Learning Dynamics:** Both NPI licensing and filler-gap dependency tracking mechanisms emerge in discrete stages during training. - **Interpretability and Safety:** While interpretability methods can help understand LM behavior, they do not guarantee safe deployment in high-risk settings. **Limitations:** - **Task Diversity:** Future work should explore interpretability methods on a broader range of linguistic and non-linguistic behaviors. - **Language Data:** The benchmark currently includes only English data, which may yield different results in other languages. - **Model Variability:** Results may differ across different models trained on the same data. **Ethics:** - Emphasizes the importance of critical thinking about the implications

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

19 Feb 2024 | Aryaman Arora Dan Jurafsky Christopher Potts