Decomposing and Editing Predictions by Modeling Model Computation

Decomposing and Editing Predictions by Modeling Model Computation

17 Apr 2024 | Harshay Shah, Andrew Ilyas, Aleksander Madry
This paper introduces a framework for decomposing and editing predictions by modeling model computation. The goal is to understand how individual components of a machine learning model contribute to its predictions. We introduce the task of component modeling, which aims to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions, and demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks: fixing model errors, "forgetting" specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at https://github.com/MadryLab/modelcomponents.This paper introduces a framework for decomposing and editing predictions by modeling model computation. The goal is to understand how individual components of a machine learning model contribute to its predictions. We introduce the task of component modeling, which aims to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions, and demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks: fixing model errors, "forgetting" specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at https://github.com/MadryLab/modelcomponents.
Reach us at info@study.space
[slides] Decomposing and Editing Predictions by Modeling Model Computation | StudySpace