13 Jun 2017 | Mukund Sundararajan * 1 Ankur Taly * 1 Qiqi Yan * 1
Axiomatic Attribution for Deep Networks introduces a new method called Integrated Gradients for attributing the prediction of a deep network to its input features. The authors identify two fundamental axioms—Sensitivity and Implementation Invariance—that attribution methods should satisfy. They show that most existing methods fail to meet these axioms, which are critical for ensuring the reliability and interpretability of attribution techniques. Integrated Gradients is designed to satisfy both axioms, requiring no modification to the original network and being simple to implement with just a few calls to the gradient operator. The method is applied to various models, including image, text, and chemistry models, demonstrating its ability to debug networks, extract rules, and improve user understanding of model predictions. The method is theoretically justified and shown to be robust to implementation details, ensuring that attributions are consistent across functionally equivalent networks. The paper also discusses the uniqueness of Integrated Gradients and its advantages over other attribution methods, highlighting its ability to preserve symmetry and provide meaningful insights into model behavior. The results show that Integrated Gradients outperforms other methods in reflecting distinctive features of inputs and provides accurate attributions for various tasks, including object recognition, diabetic retinopathy prediction, question classification, neural machine translation, and chemistry models. The method is also shown to be effective in identifying degenerate features and anomalies in network behavior. The paper concludes that Integrated Gradients is a strong and theoretically sound approach for attributing deep network predictions.Axiomatic Attribution for Deep Networks introduces a new method called Integrated Gradients for attributing the prediction of a deep network to its input features. The authors identify two fundamental axioms—Sensitivity and Implementation Invariance—that attribution methods should satisfy. They show that most existing methods fail to meet these axioms, which are critical for ensuring the reliability and interpretability of attribution techniques. Integrated Gradients is designed to satisfy both axioms, requiring no modification to the original network and being simple to implement with just a few calls to the gradient operator. The method is applied to various models, including image, text, and chemistry models, demonstrating its ability to debug networks, extract rules, and improve user understanding of model predictions. The method is theoretically justified and shown to be robust to implementation details, ensuring that attributions are consistent across functionally equivalent networks. The paper also discusses the uniqueness of Integrated Gradients and its advantages over other attribution methods, highlighting its ability to preserve symmetry and provide meaningful insights into model behavior. The results show that Integrated Gradients outperforms other methods in reflecting distinctive features of inputs and provides accurate attributions for various tasks, including object recognition, diabetic retinopathy prediction, question classification, neural machine translation, and chemistry models. The method is also shown to be effective in identifying degenerate features and anomalies in network behavior. The paper concludes that Integrated Gradients is a strong and theoretically sound approach for attributing deep network predictions.