7 Jun 2018 | Been Kim Martin Wattenberg Justin Gilmer Carrie Cai James Wexler Fernanda Viegas Rory Sayres
The paper introduces Concept Activation Vectors (CAVs) and Testing with CAVs (TCAV) as a method to interpret deep learning models in terms of human-friendly concepts. CAVs are learned by training a linear classifier to distinguish between activations produced by a concept's examples and random counterexamples, and then taking the vector orthogonal to the decision boundary. TCAV uses directional derivatives to quantify the sensitivity of a model's prediction to changes in inputs towards the direction of a concept. The method is applied to image classification and a medical application, revealing insights and biases in widely used neural network models and helping to interpret a model predicting diabetic retinopathy. The paper also includes statistical significance testing to ensure the reliability of CAVs and compares TCAV with saliency maps, showing that TCAV provides more accurate and interpretable results.The paper introduces Concept Activation Vectors (CAVs) and Testing with CAVs (TCAV) as a method to interpret deep learning models in terms of human-friendly concepts. CAVs are learned by training a linear classifier to distinguish between activations produced by a concept's examples and random counterexamples, and then taking the vector orthogonal to the decision boundary. TCAV uses directional derivatives to quantify the sensitivity of a model's prediction to changes in inputs towards the direction of a concept. The method is applied to image classification and a medical application, revealing insights and biases in widely used neural network models and helping to interpret a model predicting diabetic retinopathy. The paper also includes statistical significance testing to ensure the reliability of CAVs and compares TCAV with saliency maps, showing that TCAV provides more accurate and interpretable results.