Towards Robust Interpretability with Self-Explaining Neural Networks

Towards Robust Interpretability with Self-Explaining Neural Networks

3 Dec 2018 | David Alvarez-Melis, Tommi S. Jaakkola
This paper introduces self-explaining neural networks (SENNs), which are designed to provide interpretable explanations during the learning process. The authors propose three key desiderata for explanations: explicitness, faithfulness, and stability. Existing methods often fail to meet these criteria, leading to unstable and contradictory explanations. To address this, the authors design self-explaining models that progressively generalize linear classifiers to more complex models with architecturally explicit structures. Faithfulness and stability are enforced through tailored regularization techniques. Experimental results across various benchmark datasets show that the proposed framework offers a promising direction for reconciling model complexity and interpretability. The paper discusses the generalization of linear models to self-explaining models, emphasizing the importance of stable coefficients and interpretable basis concepts. The authors propose a class of self-explaining models that use a combination of concept encoders, input-dependent parametrizers, and aggregation functions to produce explanations. These models are trained with regularization to ensure stability and interpretability. The authors also discuss the learning of interpretable basis concepts, emphasizing the importance of fidelity, diversity, and grounding. The paper evaluates the performance of the proposed method on various datasets, including MNIST, UCI, COMPAS, and CIFAR10. The results show that the proposed method outperforms other interpretability methods in terms of explicitness, faithfulness, and stability. The authors also discuss the challenges of evaluating interpretability and the importance of robustness in explanations. The paper concludes that self-explaining models offer a promising avenue for improving interpretability in complex machine learning models.This paper introduces self-explaining neural networks (SENNs), which are designed to provide interpretable explanations during the learning process. The authors propose three key desiderata for explanations: explicitness, faithfulness, and stability. Existing methods often fail to meet these criteria, leading to unstable and contradictory explanations. To address this, the authors design self-explaining models that progressively generalize linear classifiers to more complex models with architecturally explicit structures. Faithfulness and stability are enforced through tailored regularization techniques. Experimental results across various benchmark datasets show that the proposed framework offers a promising direction for reconciling model complexity and interpretability. The paper discusses the generalization of linear models to self-explaining models, emphasizing the importance of stable coefficients and interpretable basis concepts. The authors propose a class of self-explaining models that use a combination of concept encoders, input-dependent parametrizers, and aggregation functions to produce explanations. These models are trained with regularization to ensure stability and interpretability. The authors also discuss the learning of interpretable basis concepts, emphasizing the importance of fidelity, diversity, and grounding. The paper evaluates the performance of the proposed method on various datasets, including MNIST, UCI, COMPAS, and CIFAR10. The results show that the proposed method outperforms other interpretability methods in terms of explicitness, faithfulness, and stability. The authors also discuss the challenges of evaluating interpretability and the importance of robustness in explanations. The paper concludes that self-explaining models offer a promising avenue for improving interpretability in complex machine learning models.
Reach us at info@study.space