[slides] Decomposing and Editing Predictions by Modeling Model Computation

This paper introduces a framework for decomposing and editing predictions by modeling model computation. The goal is to understand how individual components of a machine learning model contribute to its predictions. We introduce the task of component modeling, which aims to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions, and demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks: fixing model errors, "forgetting" specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at https://github.com/MadryLab/modelcomponents.This paper introduces a framework for decomposing and editing predictions by modeling model computation. The goal is to understand how individual components of a machine learning model contribute to its predictions. We introduce the task of component modeling, which aims to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions, and demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks: fixing model errors, "forgetting" specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at https://github.com/MadryLab/modelcomponents.

Decomposing and Editing Predictions by Modeling Model Computation

17 Apr 2024 | Harshay Shah, Andrew Ilyas, Aleksander Madry