The paper "Interpretable Explanations of Black Boxes by Meaningful Perturbation" by Ruth C. Fong and Andrea Vedaldi addresses the critical need for explaining the predictions of machine learning algorithms, particularly in high-impact and high-risk applications. The authors propose a general framework for learning explanations as meta-predictors, which can be applied to any black box algorithm. They specialize this framework to identify the parts of an image most responsible for a classifier's decision, making it model-agnostic and testable through explicit and interpretable image perturbations.
The paper revisits the concept of "explanation" at a formal level, aiming to develop principles and methods to explain any black box function. The authors argue that explanations should be interpretable rules that describe the input-output relationship captured by the function. They propose a framework where explanations are learned as meta-predictors, measured by their faithfulness to the classifier's predictions.
The paper also introduces a novel image saliency paradigm that learns where an algorithm looks by discovering which parts of an image most affect its output score when perturbed. Unlike many existing saliency techniques, this method explicitly edits the image, making it interpretable and testable. The authors demonstrate the effectiveness of their method through various experiments, including interpretability, deletion region representativeness, minimality of deletions, and adversarial defense.
The paper concludes with a comprehensive framework for learning explanations and a new image saliency paradigm that provides interpretable and testable explanations for black box algorithms.The paper "Interpretable Explanations of Black Boxes by Meaningful Perturbation" by Ruth C. Fong and Andrea Vedaldi addresses the critical need for explaining the predictions of machine learning algorithms, particularly in high-impact and high-risk applications. The authors propose a general framework for learning explanations as meta-predictors, which can be applied to any black box algorithm. They specialize this framework to identify the parts of an image most responsible for a classifier's decision, making it model-agnostic and testable through explicit and interpretable image perturbations.
The paper revisits the concept of "explanation" at a formal level, aiming to develop principles and methods to explain any black box function. The authors argue that explanations should be interpretable rules that describe the input-output relationship captured by the function. They propose a framework where explanations are learned as meta-predictors, measured by their faithfulness to the classifier's predictions.
The paper also introduces a novel image saliency paradigm that learns where an algorithm looks by discovering which parts of an image most affect its output score when perturbed. Unlike many existing saliency techniques, this method explicitly edits the image, making it interpretable and testable. The authors demonstrate the effectiveness of their method through various experiments, including interpretability, deletion region representativeness, minimality of deletions, and adversarial defense.
The paper concludes with a comprehensive framework for learning explanations and a new image saliency paradigm that provides interpretable and testable explanations for black box algorithms.