CausalGym: Benchmarking causal interpretability methods on linguistic tasks

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

19 Feb 2024 | Aryaman Arora Dan Jurafsky Christopher Potts
CausalGym is a benchmark for evaluating the causal effectiveness of interpretability methods on linguistic tasks. The paper introduces CausalGym, which adapts the SyntaxGym tasks to assess how well interpretability methods can causally influence model behavior. The study focuses on the Pythia models (14M–6.9B) and evaluates various interpretability methods, including linear probing and distributed alignment search (DAS). DAS outperforms other methods, and is used to study the learning trajectory of two linguistic phenomena: negative polarity item licensing and filler-gap dependencies. The analysis shows that these mechanisms are learned in discrete stages, not gradually. The paper introduces a framework for intervention-based interpretability, where interventions are used to test the causal role of neural network components in implementing certain behaviors. The framework is based on the do-operator from causal inference and measures the strength of an intervention using log odds-ratio. The paper evaluates 29 tasks, including one novel task (agr_gender) and 28 adapted from SyntaxGym. The results show that DAS consistently finds the most causally-efficacious features, followed by probing and difference-in-means. The unsupervised methods PCA and k-means are considerably worse. Despite supervision, LDA barely outperforms random features. The paper also studies how LMs learn negative polarity item (NPI) licensing and wh-extraction from prepositional phrases over the course of training. The results show that the input feature crosses over several different positions before arriving at the output position. For both tasks, the model initially learns to move information directly from the alternating token to the output position. Later in training, intermediate steps are added in the middle layers. DAS finds a greater causal effect across the board, but both methods largely agree on which regions are the most causally efficacious at each layer. The paper concludes that CausalGym provides a benchmark for measuring the causal efficacy of interpretability methods. It encourages computational psycholinguists to move beyond studying the input-output behaviors of LMs and to study how LMs learn linguistic behaviors. The paper also highlights the importance of causal evaluation in understanding neural networks and the need for further research on a greater variety of tasks.CausalGym is a benchmark for evaluating the causal effectiveness of interpretability methods on linguistic tasks. The paper introduces CausalGym, which adapts the SyntaxGym tasks to assess how well interpretability methods can causally influence model behavior. The study focuses on the Pythia models (14M–6.9B) and evaluates various interpretability methods, including linear probing and distributed alignment search (DAS). DAS outperforms other methods, and is used to study the learning trajectory of two linguistic phenomena: negative polarity item licensing and filler-gap dependencies. The analysis shows that these mechanisms are learned in discrete stages, not gradually. The paper introduces a framework for intervention-based interpretability, where interventions are used to test the causal role of neural network components in implementing certain behaviors. The framework is based on the do-operator from causal inference and measures the strength of an intervention using log odds-ratio. The paper evaluates 29 tasks, including one novel task (agr_gender) and 28 adapted from SyntaxGym. The results show that DAS consistently finds the most causally-efficacious features, followed by probing and difference-in-means. The unsupervised methods PCA and k-means are considerably worse. Despite supervision, LDA barely outperforms random features. The paper also studies how LMs learn negative polarity item (NPI) licensing and wh-extraction from prepositional phrases over the course of training. The results show that the input feature crosses over several different positions before arriving at the output position. For both tasks, the model initially learns to move information directly from the alternating token to the output position. Later in training, intermediate steps are added in the middle layers. DAS finds a greater causal effect across the board, but both methods largely agree on which regions are the most causally efficacious at each layer. The paper concludes that CausalGym provides a benchmark for measuring the causal efficacy of interpretability methods. It encourages computational psycholinguists to move beyond studying the input-output behaviors of LMs and to study how LMs learn linguistic behaviors. The paper also highlights the importance of causal evaluation in understanding neural networks and the need for further research on a greater variety of tasks.
Reach us at info@study.space