7 Jun 2018 | Been Kim Martin Wattenberg Justin Gilmer Carrie Cai James Wexler Fernanda Viegas Rory Sayres
This paper introduces Concept Activation Vectors (CAVs) and Testing with CAVs (TCAV) as a method for interpreting deep learning models. CAVs are vectors that represent high-level concepts in terms of neural network activations, enabling the quantification of how important a concept is to a model's prediction. TCAV uses directional derivatives to measure the sensitivity of a model's prediction to a concept, providing a quantitative measure of conceptual sensitivity. The method is designed to be accessible, customizable, and plug-in ready, allowing users to define their own concepts and interpret model behavior without retraining the model.
The paper demonstrates the effectiveness of TCAV through experiments on image classification and a medical application. It shows that TCAV can reveal biases in widely-used neural networks and provide insights into how models make decisions. For example, TCAV confirms that models are sensitive to gender and race, even when these are not explicitly trained on. It also helps identify important diagnostic concepts for diabetic retinopathy, aiding medical experts in interpreting model predictions.
TCAV is validated through statistical significance testing, ensuring that the results are meaningful and not due to random chance. The method is also tested against adversarial examples, showing that TCAV can differentiate between regular and adversarial inputs. Additionally, TCAV is compared to saliency maps, which are commonly used for model interpretation, and found to be more effective in communicating the importance of concepts to humans.
The paper concludes that TCAV provides a powerful tool for interpreting deep learning models, enabling users to understand how models make decisions in terms of human-friendly concepts. Future work includes applying TCAV to other types of data and exploring its potential in areas beyond interpretability, such as identifying adversarial examples.This paper introduces Concept Activation Vectors (CAVs) and Testing with CAVs (TCAV) as a method for interpreting deep learning models. CAVs are vectors that represent high-level concepts in terms of neural network activations, enabling the quantification of how important a concept is to a model's prediction. TCAV uses directional derivatives to measure the sensitivity of a model's prediction to a concept, providing a quantitative measure of conceptual sensitivity. The method is designed to be accessible, customizable, and plug-in ready, allowing users to define their own concepts and interpret model behavior without retraining the model.
The paper demonstrates the effectiveness of TCAV through experiments on image classification and a medical application. It shows that TCAV can reveal biases in widely-used neural networks and provide insights into how models make decisions. For example, TCAV confirms that models are sensitive to gender and race, even when these are not explicitly trained on. It also helps identify important diagnostic concepts for diabetic retinopathy, aiding medical experts in interpreting model predictions.
TCAV is validated through statistical significance testing, ensuring that the results are meaningful and not due to random chance. The method is also tested against adversarial examples, showing that TCAV can differentiate between regular and adversarial inputs. Additionally, TCAV is compared to saliency maps, which are commonly used for model interpretation, and found to be more effective in communicating the importance of concepts to humans.
The paper concludes that TCAV provides a powerful tool for interpreting deep learning models, enabling users to understand how models make decisions in terms of human-friendly concepts. Future work includes applying TCAV to other types of data and exploring its potential in areas beyond interpretability, such as identifying adversarial examples.