15 Jul 2024 | Michael Hanna, Sandro Pezzelle, Yonatan Belinkov
The paper "Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms" by Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov introduces a new method called Edge Attribution Patching with Integrated Gradients (EAP-IG) for finding circuits in language models (LMs). The authors critique the existing method, Edge Attribution Patching (EAP), which uses causal interventions to determine the edges in a circuit, noting that this approach scales poorly with model size. EAP-IG aims to address this issue by incorporating integrated gradients, which consider the gradient at intermediate points between clean and corrupted activations. This method is designed to find more faithful circuits, where all edges outside the circuit can be ablated without changing the model's behavior on a given task. The paper demonstrates that EAP-IG circuits are more faithful than those found using EAP, even though they have high node overlap with manually found circuits. The authors also investigate the relationship between overlap and faithfulness, finding that while overlap is strongly correlated with faithfulness, it is not a good predictor of cross-task faithfulness when overlaps are moderate. The paper concludes by discussing the importance of faithfulness in circuit-finding and recommends best practices for circuit analysis.The paper "Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms" by Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov introduces a new method called Edge Attribution Patching with Integrated Gradients (EAP-IG) for finding circuits in language models (LMs). The authors critique the existing method, Edge Attribution Patching (EAP), which uses causal interventions to determine the edges in a circuit, noting that this approach scales poorly with model size. EAP-IG aims to address this issue by incorporating integrated gradients, which consider the gradient at intermediate points between clean and corrupted activations. This method is designed to find more faithful circuits, where all edges outside the circuit can be ablated without changing the model's behavior on a given task. The paper demonstrates that EAP-IG circuits are more faithful than those found using EAP, even though they have high node overlap with manually found circuits. The authors also investigate the relationship between overlap and faithfulness, finding that while overlap is strongly correlated with faithfulness, it is not a good predictor of cross-task faithfulness when overlaps are moderate. The paper concludes by discussing the importance of faithfulness in circuit-finding and recommends best practices for circuit analysis.