26 Oct 2024 | Sonia Laguna; Ričards Marcinkevičs; Moritz Vandenhirtz, Julia E. Vogt
Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?
Sonia Laguna, Ričards Marcinkevičs, Moritz Vandenhirtz, Julia E. Vogt
Abstract: This paper introduces a method to perform concept-based interventions on pretrained neural networks, which are not interpretable by design, only given a small validation set with concept labels. We formalise the notion of intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black boxes. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We focus on backbone architectures of varying complexity, from simple, fully connected neural nets to Stable Diffusion. We demonstrate that the proposed fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of our techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes are more intervenable than CBMs. Lastly, we establish that our methods are still effective under vision-language-model-based concept annotations, alleviating the need for a human-annotated validation set.
Introduction: Interpretable and explainable machine learning has seen renewed interest in concept-based predictive models and post hoc explanation techniques. This work focuses on concept bottleneck models (CBMs), which allow for human-model interaction by enabling users to intervene on predicted concept values. We focus on instance-specific interventions, i.e., performed individually for each data point. To this end, we explore two questions: (i) Given a small validation set with concept labels, how can we perform instance-specific interventions directly on a pretrained black-box model? (ii) How can we fine-tune the black-box model to improve the effectiveness of interventions performed on it?
Our contributions include: (1) A simple procedure to perform concept-based instance-specific interventions on a pretrained black-box neural network by editing its activations at an intermediate layer. (2) Formalising intervenability as a measure of the effectiveness of interventions performed on the model. (3) Evaluating the proposed procedures on synthetic tabular, natural image, and medical imaging data, demonstrating that concept-based interventions improve predictive performance of pretrained black-box models. We also show that our methods are effective in datasets where concept labels are acquired using vision-language models (VLM), alleviating the need for human annotation.
Related Work: The use of high-level attributes in predictive models has been well-explored in computer vision. Recent efforts have focused on explicitly incorporating concepts in neural networks, producing high-level post hoc explanations by quantifying the network's sensitivity to the attributes. Other works have studied the use of auxiliary external attributes in out-of-distribution settings. To alleviate the assumption of being given interpretable concepts, some have explored concept discovery prior to post hoc explanation. Another relevant line of work investigated concept-based counterfactual explanations.
Methods: We define a measure for the effectiveness of concept-based interventionsBeyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?
Sonia Laguna, Ričards Marcinkevičs, Moritz Vandenhirtz, Julia E. Vogt
Abstract: This paper introduces a method to perform concept-based interventions on pretrained neural networks, which are not interpretable by design, only given a small validation set with concept labels. We formalise the notion of intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black boxes. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We focus on backbone architectures of varying complexity, from simple, fully connected neural nets to Stable Diffusion. We demonstrate that the proposed fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of our techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes are more intervenable than CBMs. Lastly, we establish that our methods are still effective under vision-language-model-based concept annotations, alleviating the need for a human-annotated validation set.
Introduction: Interpretable and explainable machine learning has seen renewed interest in concept-based predictive models and post hoc explanation techniques. This work focuses on concept bottleneck models (CBMs), which allow for human-model interaction by enabling users to intervene on predicted concept values. We focus on instance-specific interventions, i.e., performed individually for each data point. To this end, we explore two questions: (i) Given a small validation set with concept labels, how can we perform instance-specific interventions directly on a pretrained black-box model? (ii) How can we fine-tune the black-box model to improve the effectiveness of interventions performed on it?
Our contributions include: (1) A simple procedure to perform concept-based instance-specific interventions on a pretrained black-box neural network by editing its activations at an intermediate layer. (2) Formalising intervenability as a measure of the effectiveness of interventions performed on the model. (3) Evaluating the proposed procedures on synthetic tabular, natural image, and medical imaging data, demonstrating that concept-based interventions improve predictive performance of pretrained black-box models. We also show that our methods are effective in datasets where concept labels are acquired using vision-language models (VLM), alleviating the need for human annotation.
Related Work: The use of high-level attributes in predictive models has been well-explored in computer vision. Recent efforts have focused on explicitly incorporating concepts in neural networks, producing high-level post hoc explanations by quantifying the network's sensitivity to the attributes. Other works have studied the use of auxiliary external attributes in out-of-distribution settings. To alleviate the assumption of being given interpretable concepts, some have explored concept discovery prior to post hoc explanation. Another relevant line of work investigated concept-based counterfactual explanations.
Methods: We define a measure for the effectiveness of concept-based interventions