Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms

2024 | Michael Hanna, Sandro Pezzelle, Yonatan Belinkov
This paper introduces EAP-IG, a new method for finding circuits in language models (LMs) that improves upon the existing edge attribution patching (EAP) method by incorporating integrated gradients. The circuits framework aims to identify minimal computational subgraphs that explain LM behavior on a given task. While EAP has high node overlap with manually found circuits, circuits found using EAP-IG are more faithful, meaning that ablating all edges outside the circuit does not change the model's behavior on the task. Faithfulness is crucial for studying circuits, as it ensures that the simplified model accurately reflects the full model's behavior. The paper evaluates EAP-IG against EAP and activation patching circuits across six tasks, including Indirect Object Identification (IOI), Gender-Bias, Greater-Than, Capital-Country, Subject-Verb Agreement (SVA), and Hypernymy. Results show that EAP-IG circuits are more faithful than EAP circuits on many tasks, although activation patching circuits sometimes outperform both. The study also highlights that overlap between circuits is not a reliable indicator of faithfulness, as circuits with high overlap may not be faithful. The paper concludes that faithfulness is a critical quality for circuit finding, and that EAP-IG provides a more faithful alternative to EAP. The method is implemented and made available for further research. The study also discusses the importance of measuring faithfulness separately from overlap, as they are not directly correlated. The results suggest that while overlap can be a useful metric, it is not a reliable predictor of faithfulness, especially in cross-task scenarios. The paper emphasizes the need for further research into circuit finding methods that prioritize faithfulness and completeness.This paper introduces EAP-IG, a new method for finding circuits in language models (LMs) that improves upon the existing edge attribution patching (EAP) method by incorporating integrated gradients. The circuits framework aims to identify minimal computational subgraphs that explain LM behavior on a given task. While EAP has high node overlap with manually found circuits, circuits found using EAP-IG are more faithful, meaning that ablating all edges outside the circuit does not change the model's behavior on the task. Faithfulness is crucial for studying circuits, as it ensures that the simplified model accurately reflects the full model's behavior. The paper evaluates EAP-IG against EAP and activation patching circuits across six tasks, including Indirect Object Identification (IOI), Gender-Bias, Greater-Than, Capital-Country, Subject-Verb Agreement (SVA), and Hypernymy. Results show that EAP-IG circuits are more faithful than EAP circuits on many tasks, although activation patching circuits sometimes outperform both. The study also highlights that overlap between circuits is not a reliable indicator of faithfulness, as circuits with high overlap may not be faithful. The paper concludes that faithfulness is a critical quality for circuit finding, and that EAP-IG provides a more faithful alternative to EAP. The method is implemented and made available for further research. The study also discusses the importance of measuring faithfulness separately from overlap, as they are not directly correlated. The results suggest that while overlap can be a useful metric, it is not a reliable predictor of faithfulness, especially in cross-task scenarios. The paper emphasizes the need for further research into circuit finding methods that prioritize faithfulness and completeness.
Reach us at info@study.space